This is a discussion on mixed encodings - how to manage within the MySQL Database forums, part of the Database Forums category; I have a table in MySQL that is presently entirely encoded with latin1 charset. Several of the varchar fields are ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
I have a table in MySQL that is presently entirely encoded with latin1
charset. Several of the varchar fields are used as CSS styling values and, as such, do not need to be encoded with utf8. However, two of the fields become visible content in the HTML page, and I want to change the encoding for those two fields to utf8. What kind of header() and meta tag specifiers should I use in a mixed encoding situation like that? If I specify "UTF-8" will the browser be able to automatically distinguish the situations that require reading one byte (latin1) from those requiring more bytes (utf8)? |
|
|||
|
On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
wrote: >I have a table in MySQL that is presently entirely encoded with latin1 >charset. Several of the varchar fields are used as CSS styling values >and, as such, do not need to be encoded with utf8. However, two of >the fields become visible content in the HTML page, and I want to >change the encoding for those two fields to utf8. > >What kind of header() and meta tag specifiers should I use in a mixed >encoding situation like that? If I specify "UTF-8" will the browser >be able to automatically distinguish the situations that require >reading one byte (latin1) from those requiring more bytes (utf8)? Using UTF-8 won't hurt. For the ASCII part (0-9, A-Z, a-z, almost all punctation) the bytes are exactly the same. Only special characters need a 2 or sometimes 3 byte sequence. Mixed encoding within a web page isn't really possible. The whole page uses whatever is declared in the encoding header, the DOCUMENT TYPE or in the meta header. You can't switch to another encoding halfway. Again, the single byte non-diacritic characters are mostly the same between ISO-8859-1 and UTF-8, only the encoding of special ones is different. You have to choose one and stick to it. If ISO-8859-1 misses a few symbols you need, you'd better use UTF-8 everywhere. That way you cover many many more symbols than in any 1 byte encoding. (Anyone please correct me if I'm wrong). -- ( Kees ) c[_] I'll never own an AM radio... What good is a radio that won't work after noon? (#529) |
|
|||
|
On Sun, 20 Jan 2008 19:51:45 +0100, Kees Nuyt <k.nuyt@nospam.demon.nl>
wrote: >On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com >wrote: > >>I have a table in MySQL that is presently entirely encoded with latin1 >>charset. Several of the varchar fields are used as CSS styling values >>and, as such, do not need to be encoded with utf8. However, two of >>the fields become visible content in the HTML page, and I want to >>change the encoding for those two fields to utf8. >> >>What kind of header() and meta tag specifiers should I use in a mixed >>encoding situation like that? If I specify "UTF-8" will the browser >>be able to automatically distinguish the situations that require >>reading one byte (latin1) from those requiring more bytes (utf8)? > >Using UTF-8 won't hurt. >For the ASCII part (0-9, A-Z, a-z, almost all punctation) >the bytes are exactly the same. >Only special characters need a 2 or sometimes 3 byte >sequence. > >Mixed encoding within a web page isn't really possible. >The whole page uses whatever is declared in the encoding >header, the DOCUMENT TYPE or in the meta header. >You can't switch to another encoding halfway. > >Again, the single byte non-diacritic characters are mostly >the same between ISO-8859-1 and UTF-8, only the encoding >of special ones is different. >You have to choose one and stick to it. If ISO-8859-1 >misses a few symbols you need, you'd better use UTF-8 >everywhere. That way you cover many many more symbols than >in any 1 byte encoding. > >(Anyone please correct me if I'm wrong). This is what I thought was likely, and it certainly makes sense. However, what threw me was concern about out how a browser set to read an html file using utf8 would be able to recognize a 1 byte character, when it was also expecting 2 or 3 or even 4 byte characters as well. Maybe I don't have to worry about this and can proceed on faith alone, but it would be both interesting and reassuring to know the answer. Does anyone know the way this is accomplished? |
|
|||
|
Kees Nuyt wrote:
> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com > wrote: > >> I have a table in MySQL that is presently entirely encoded with latin1 >> charset. Several of the varchar fields are used as CSS styling values >> and, as such, do not need to be encoded with utf8. However, two of >> the fields become visible content in the HTML page, and I want to >> change the encoding for those two fields to utf8. >> >> What kind of header() and meta tag specifiers should I use in a mixed >> encoding situation like that? If I specify "UTF-8" will the browser >> be able to automatically distinguish the situations that require >> reading one byte (latin1) from those requiring more bytes (utf8)? > > Using UTF-8 won't hurt. > For the ASCII part (0-9, A-Z, a-z, almost all punctation) > the bytes are exactly the same. > Only special characters need a 2 or sometimes 3 byte > sequence. The characters with ASCII value 127 and lower are the same, everything else are different. > Mixed encoding within a web page isn't really possible. > The whole page uses whatever is declared in the encoding > header, the DOCUMENT TYPE or in the meta header. > You can't switch to another encoding halfway. If the data in the database is mixed, then the charset has to be unified before injected into the "page", or else characters may not be displayes correctly, trying to show iso-8859 characters as UTF-8 will result in quite many question marks and the other way around will result in strange characters. -- //Aho |
|
|||
|
firewoodtim@yahoo.com wrote:
> This is what I thought was likely, and it certainly makes sense. > However, what threw me was concern about out how a browser set to read > an html file using utf8 would be able to recognize a 1 byte character, 1 byte characters has a ASCII value of 127 or less. > when it was also expecting 2 or 3 or even 4 byte characters as well. Those start with a 128 or higher ASCII value. -- //Aho |
|
|||
|
On Sun, 20 Jan 2008 20:23:01 +0100, "J.O. Aho" <user@example.net>
wrote: >firewoodtim@yahoo.com wrote: > >> This is what I thought was likely, and it certainly makes sense. >> However, what threw me was concern about out how a browser set to read >> an html file using utf8 would be able to recognize a 1 byte character, > >1 byte characters has a ASCII value of 127 or less. > >> when it was also expecting 2 or 3 or even 4 byte characters as well. > >Those start with a 128 or higher ASCII value. So just to be sure, let me see if I understand this correctly. I have a PHP script running that takes latin1 data from a variety of MySQL columns and uses them as CSS values in "style" attributes. These do not have to be converted to utf8 encoding, because latin1 values are a subset of utf8 and any conversion would result in the same 1 byte characters that I started with anyway. In the case of the utf8 fields that will be displayed visibly in the browser, not used as html styling values, these will be correctly interpreted by the browser, since if there are any non-latin1 characters, they will be recognized by the browser as utf8 from their encoding values (128 or higher) and displayed correctly. This means that if I specify utf8 using the header() function and again in the meta tags, and then use latin1 and utf8 characters in the script to form the styling and visible content respectively, everything should turn out OK. Am I right? |
|
|||
|
On Sun, 20 Jan 2008 20:10:28 +0100, "J.O. Aho"
<user@example.net> wrote: >Kees Nuyt wrote: >> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com >> wrote: >> [snip] >> Using UTF-8 won't hurt. >> For the ASCII part (0-9, A-Z, a-z, almost all punctation) >> the bytes are exactly the same. >> Only special characters need a 2 or sometimes 3 byte >> sequence. > >The characters with ASCII value 127 and lower are the same, everything else >are different. > > >> Mixed encoding within a web page isn't really possible. >> The whole page uses whatever is declared in the encoding >> header, the DOCUMENT TYPE or in the meta header. >> You can't switch to another encoding halfway. > >If the data in the database is mixed, then the charset has to be unified >before injected into the "page", or else characters may not be displayes >correctly, trying to show iso-8859 characters as UTF-8 will result in quite >many question marks and the other way around will result in strange characters. I agree. -- ( Kees ) c[_] There is only one boss, the customer. And he can fire everybody in the company from the chairman on down, simply by spending his money somewhere else. (Sam Walton) (#35) |
|
|||
|
On Sun, 20 Jan 2008 14:08:23 -0500, firewoodtim@yahoo.com
wrote: >On Sun, 20 Jan 2008 19:51:45 +0100, Kees Nuyt <k.nuyt@nospam.demon.nl> >wrote: > >>On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com >>wrote: >> >>>I have a table in MySQL that is presently entirely encoded with latin1 >>>charset. Several of the varchar fields are used as CSS styling values >>>and, as such, do not need to be encoded with utf8. However, two of >>>the fields become visible content in the HTML page, and I want to >>>change the encoding for those two fields to utf8. >>> >>>What kind of header() and meta tag specifiers should I use in a mixed >>>encoding situation like that? If I specify "UTF-8" will the browser >>>be able to automatically distinguish the situations that require >>>reading one byte (latin1) from those requiring more bytes (utf8)? >> >>Using UTF-8 won't hurt. >>For the ASCII part (0-9, A-Z, a-z, almost all punctation) >>the bytes are exactly the same. >>Only special characters need a 2 or sometimes 3 byte >>sequence. >> >>Mixed encoding within a web page isn't really possible. >>The whole page uses whatever is declared in the encoding >>header, the DOCUMENT TYPE or in the meta header. >>You can't switch to another encoding halfway. >> >>Again, the single byte non-diacritic characters are mostly >>the same between ISO-8859-1 and UTF-8, only the encoding >>of special ones is different. >>You have to choose one and stick to it. If ISO-8859-1 >>misses a few symbols you need, you'd better use UTF-8 >>everywhere. That way you cover many many more symbols than >>in any 1 byte encoding. >> >>(Anyone please correct me if I'm wrong). > >This is what I thought was likely, and it certainly makes sense. >However, what threw me was concern about out how a browser set to read >an html file using utf8 would be able to recognize a 1 byte character, >when it was also expecting 2 or 3 or even 4 byte characters as well. >Maybe I don't have to worry about this and can proceed on faith alone, >but it would be both interesting and reassuring to know the answer. >Does anyone know the way this is accomplished? Here is how UTF-8 works: http://en.wikipedia.org/wiki/UTF-8#Description -- ( Kees ) c[_] Why is the alphabet in that order? Is it because of that song? (#495) |
|
|||
|
Kees Nuyt wrote:
> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com > wrote: > >> I have a table in MySQL that is presently entirely encoded with latin1 >> charset. Several of the varchar fields are used as CSS styling values >> and, as such, do not need to be encoded with utf8. However, two of >> the fields become visible content in the HTML page, and I want to >> change the encoding for those two fields to utf8. >> >> What kind of header() and meta tag specifiers should I use in a mixed >> encoding situation like that? If I specify "UTF-8" will the browser >> be able to automatically distinguish the situations that require >> reading one byte (latin1) from those requiring more bytes (utf8)? > > Using UTF-8 won't hurt. I strongly disagree. utf8 adds significant overhead to string processing and in many cases MySQL has to reserve 3 bytes per utf8 character (i.e. for index- or record size). The right approach is to use the "minimal" charset that can encode the characters in question. So in the before mentioned table only the two fields should be changed to use the utf8 encoding. > Mixed encoding within a web page isn't really possible. Right. MySQL solved this problem with the introduction of the client related charset settings in 4.1. Now you can store data in i.e. latin1 but all data the client sends to or retrieves from the database can be i.e. utf8 (and will automagically be converted). So the final suggestion is: - declare utf8 encoding in the HTTP header - use 'SET NAMES utf8' to declare all client <-> database traffic to use the utf8 encoding - store data in the appropriate encoding, use utf8 only where necessary XL -- Axel Schwenke, Support Engineer, MySQL AB Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/ MySQL User Forums: http://forums.mysql.com/ |
|
|||
|
On Mon, 21 Jan 2008 18:19:52 +0100, Axel Schwenke
<axel.schwenke@gmx.de> wrote: >Kees Nuyt wrote: >> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com >> wrote: >> >>> I have a table in MySQL that is presently entirely encoded with latin1 >>> charset. Several of the varchar fields are used as CSS styling values >>> and, as such, do not need to be encoded with utf8. However, two of >>> the fields become visible content in the HTML page, and I want to >>> change the encoding for those two fields to utf8. >>> >>> What kind of header() and meta tag specifiers should I use in a mixed >>> encoding situation like that? If I specify "UTF-8" will the browser >>> be able to automatically distinguish the situations that require >>> reading one byte (latin1) from those requiring more bytes (utf8)? >> >> Using UTF-8 won't hurt. > >I strongly disagree. > >utf8 adds significant overhead to string processing and in many cases MySQL >has to reserve 3 bytes per utf8 character (i.e. for index- or record size). > >The right approach is to use the "minimal" charset that can encode the >characters in question. So in the before mentioned table only the two fields >should be changed to use the utf8 encoding. > >> Mixed encoding within a web page isn't really possible. > >Right. > >MySQL solved this problem with the introduction of the client related >charset settings in 4.1. Now you can store data in i.e. latin1 but all data >the client sends to or retrieves from the database can be i.e. utf8 (and >will automagically be converted). > >So the final suggestion is: >- declare utf8 encoding in the HTTP header >- use 'SET NAMES utf8' to declare all client <-> database traffic to use the >utf8 encoding >- store data in the appropriate encoding, use utf8 only where necessary > > >XL So, using this suggestion, I could just do the following (using the same table and fields as in my original description): 1. ALTER TABLE table_name MODIFY varchar_column VARCHAR(50) NOT NULL CHARACTER SET utf8; 2. ALTER TABLE table_name MODIFY text_column TEXT NOT NULL CHARACTER SET utf8; I issue a similar ALTER command for each column in each table that I want to convert to utf8. Steps 1 & 2 will take care of converting the data from latin1 to utf8 and will set the default for those columns only to utf8 encoding, leaving the table encoding default intact at latin1. 3. In my php file, right after connecting to the db, add the line: mysql_query("SET NAMES 'utf8' "); 4. Set the html header using the line: header('Content-Type:text/html; charset=UTF-8'); 5. Set a meta tag in all my scripts to: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> Will that take care of it correctly? Are there any other steps to take? |