mixed encodings - how to manage

This is a discussion on mixed encodings - how to manage within the MySQL Database forums, part of the Database Forums category; I have a table in MySQL that is presently entirely encoded with latin1 charset. Several of the varchar fields are ...


Go Back   Usenet Forums > Database Forums > MySQL Database

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 01-20-2008
firewoodtim@yahoo.com
 
Posts: n/a
Default mixed encodings - how to manage

I have a table in MySQL that is presently entirely encoded with latin1
charset. Several of the varchar fields are used as CSS styling values
and, as such, do not need to be encoded with utf8. However, two of
the fields become visible content in the HTML page, and I want to
change the encoding for those two fields to utf8.

What kind of header() and meta tag specifiers should I use in a mixed
encoding situation like that? If I specify "UTF-8" will the browser
be able to automatically distinguish the situations that require
reading one byte (latin1) from those requiring more bytes (utf8)?
Reply With Quote
  #2 (permalink)  
Old 01-20-2008
Kees Nuyt
 
Posts: n/a
Default Re: mixed encodings - how to manage

On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
wrote:

>I have a table in MySQL that is presently entirely encoded with latin1
>charset. Several of the varchar fields are used as CSS styling values
>and, as such, do not need to be encoded with utf8. However, two of
>the fields become visible content in the HTML page, and I want to
>change the encoding for those two fields to utf8.
>
>What kind of header() and meta tag specifiers should I use in a mixed
>encoding situation like that? If I specify "UTF-8" will the browser
>be able to automatically distinguish the situations that require
>reading one byte (latin1) from those requiring more bytes (utf8)?


Using UTF-8 won't hurt.
For the ASCII part (0-9, A-Z, a-z, almost all punctation)
the bytes are exactly the same.
Only special characters need a 2 or sometimes 3 byte
sequence.

Mixed encoding within a web page isn't really possible.
The whole page uses whatever is declared in the encoding
header, the DOCUMENT TYPE or in the meta header.
You can't switch to another encoding halfway.

Again, the single byte non-diacritic characters are mostly
the same between ISO-8859-1 and UTF-8, only the encoding
of special ones is different.
You have to choose one and stick to it. If ISO-8859-1
misses a few symbols you need, you'd better use UTF-8
everywhere. That way you cover many many more symbols than
in any 1 byte encoding.

(Anyone please correct me if I'm wrong).
--
( Kees
)
c[_] I'll never own an AM radio...
What good is a radio that won't work after noon? (#529)
Reply With Quote
  #3 (permalink)  
Old 01-20-2008
firewoodtim@yahoo.com
 
Posts: n/a
Default Re: mixed encodings - how to manage

On Sun, 20 Jan 2008 19:51:45 +0100, Kees Nuyt <k.nuyt@nospam.demon.nl>
wrote:

>On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
>wrote:
>
>>I have a table in MySQL that is presently entirely encoded with latin1
>>charset. Several of the varchar fields are used as CSS styling values
>>and, as such, do not need to be encoded with utf8. However, two of
>>the fields become visible content in the HTML page, and I want to
>>change the encoding for those two fields to utf8.
>>
>>What kind of header() and meta tag specifiers should I use in a mixed
>>encoding situation like that? If I specify "UTF-8" will the browser
>>be able to automatically distinguish the situations that require
>>reading one byte (latin1) from those requiring more bytes (utf8)?

>
>Using UTF-8 won't hurt.
>For the ASCII part (0-9, A-Z, a-z, almost all punctation)
>the bytes are exactly the same.
>Only special characters need a 2 or sometimes 3 byte
>sequence.
>
>Mixed encoding within a web page isn't really possible.
>The whole page uses whatever is declared in the encoding
>header, the DOCUMENT TYPE or in the meta header.
>You can't switch to another encoding halfway.
>
>Again, the single byte non-diacritic characters are mostly
>the same between ISO-8859-1 and UTF-8, only the encoding
>of special ones is different.
>You have to choose one and stick to it. If ISO-8859-1
>misses a few symbols you need, you'd better use UTF-8
>everywhere. That way you cover many many more symbols than
>in any 1 byte encoding.
>
>(Anyone please correct me if I'm wrong).


This is what I thought was likely, and it certainly makes sense.
However, what threw me was concern about out how a browser set to read
an html file using utf8 would be able to recognize a 1 byte character,
when it was also expecting 2 or 3 or even 4 byte characters as well.
Maybe I don't have to worry about this and can proceed on faith alone,
but it would be both interesting and reassuring to know the answer.
Does anyone know the way this is accomplished?
Reply With Quote
  #4 (permalink)  
Old 01-20-2008
J.O. Aho
 
Posts: n/a
Default Re: mixed encodings - how to manage

Kees Nuyt wrote:
> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
> wrote:
>
>> I have a table in MySQL that is presently entirely encoded with latin1
>> charset. Several of the varchar fields are used as CSS styling values
>> and, as such, do not need to be encoded with utf8. However, two of
>> the fields become visible content in the HTML page, and I want to
>> change the encoding for those two fields to utf8.
>>
>> What kind of header() and meta tag specifiers should I use in a mixed
>> encoding situation like that? If I specify "UTF-8" will the browser
>> be able to automatically distinguish the situations that require
>> reading one byte (latin1) from those requiring more bytes (utf8)?

>
> Using UTF-8 won't hurt.
> For the ASCII part (0-9, A-Z, a-z, almost all punctation)
> the bytes are exactly the same.
> Only special characters need a 2 or sometimes 3 byte
> sequence.


The characters with ASCII value 127 and lower are the same, everything else
are different.


> Mixed encoding within a web page isn't really possible.
> The whole page uses whatever is declared in the encoding
> header, the DOCUMENT TYPE or in the meta header.
> You can't switch to another encoding halfway.


If the data in the database is mixed, then the charset has to be unified
before injected into the "page", or else characters may not be displayes
correctly, trying to show iso-8859 characters as UTF-8 will result in quite
many question marks and the other way around will result in strange characters.



--

//Aho
Reply With Quote
  #5 (permalink)  
Old 01-20-2008
J.O. Aho
 
Posts: n/a
Default Re: mixed encodings - how to manage

firewoodtim@yahoo.com wrote:

> This is what I thought was likely, and it certainly makes sense.
> However, what threw me was concern about out how a browser set to read
> an html file using utf8 would be able to recognize a 1 byte character,


1 byte characters has a ASCII value of 127 or less.

> when it was also expecting 2 or 3 or even 4 byte characters as well.


Those start with a 128 or higher ASCII value.


--

//Aho
Reply With Quote
  #6 (permalink)  
Old 01-20-2008
firewoodtim@yahoo.com
 
Posts: n/a
Default Re: mixed encodings - how to manage

On Sun, 20 Jan 2008 20:23:01 +0100, "J.O. Aho" <user@example.net>
wrote:

>firewoodtim@yahoo.com wrote:
>
>> This is what I thought was likely, and it certainly makes sense.
>> However, what threw me was concern about out how a browser set to read
>> an html file using utf8 would be able to recognize a 1 byte character,

>
>1 byte characters has a ASCII value of 127 or less.
>
>> when it was also expecting 2 or 3 or even 4 byte characters as well.

>
>Those start with a 128 or higher ASCII value.


So just to be sure, let me see if I understand this correctly.

I have a PHP script running that takes latin1 data from a variety of
MySQL columns and uses them as CSS values in "style" attributes. These
do not have to be converted to utf8 encoding, because latin1 values
are a subset of utf8 and any conversion would result in the same 1
byte characters that I started with anyway.

In the case of the utf8 fields that will be displayed visibly in the
browser, not used as html styling values, these will be correctly
interpreted by the browser, since if there are any non-latin1
characters, they will be recognized by the browser as utf8 from their
encoding values (128 or higher) and displayed correctly.

This means that if I specify utf8 using the header() function and
again in the meta tags, and then use latin1 and utf8 characters in the
script to form the styling and visible content respectively,
everything should turn out OK.

Am I right?
Reply With Quote
  #7 (permalink)  
Old 01-20-2008
Kees Nuyt
 
Posts: n/a
Default Re: mixed encodings - how to manage

On Sun, 20 Jan 2008 20:10:28 +0100, "J.O. Aho"
<user@example.net> wrote:

>Kees Nuyt wrote:
>> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
>> wrote:
>>


[snip]

>> Using UTF-8 won't hurt.
>> For the ASCII part (0-9, A-Z, a-z, almost all punctation)
>> the bytes are exactly the same.
>> Only special characters need a 2 or sometimes 3 byte
>> sequence.

>
>The characters with ASCII value 127 and lower are the same, everything else
>are different.
>
>
>> Mixed encoding within a web page isn't really possible.
>> The whole page uses whatever is declared in the encoding
>> header, the DOCUMENT TYPE or in the meta header.
>> You can't switch to another encoding halfway.

>
>If the data in the database is mixed, then the charset has to be unified
>before injected into the "page", or else characters may not be displayes
>correctly, trying to show iso-8859 characters as UTF-8 will result in quite
>many question marks and the other way around will result in strange characters.


I agree.
--
( Kees
)
c[_] There is only one boss, the customer. And he can fire
everybody in the company from the chairman on down,
simply by spending his money somewhere else. (Sam Walton) (#35)
Reply With Quote
  #8 (permalink)  
Old 01-20-2008
Kees Nuyt
 
Posts: n/a
Default Re: mixed encodings - how to manage

On Sun, 20 Jan 2008 14:08:23 -0500, firewoodtim@yahoo.com
wrote:

>On Sun, 20 Jan 2008 19:51:45 +0100, Kees Nuyt <k.nuyt@nospam.demon.nl>
>wrote:
>
>>On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
>>wrote:
>>
>>>I have a table in MySQL that is presently entirely encoded with latin1
>>>charset. Several of the varchar fields are used as CSS styling values
>>>and, as such, do not need to be encoded with utf8. However, two of
>>>the fields become visible content in the HTML page, and I want to
>>>change the encoding for those two fields to utf8.
>>>
>>>What kind of header() and meta tag specifiers should I use in a mixed
>>>encoding situation like that? If I specify "UTF-8" will the browser
>>>be able to automatically distinguish the situations that require
>>>reading one byte (latin1) from those requiring more bytes (utf8)?

>>
>>Using UTF-8 won't hurt.
>>For the ASCII part (0-9, A-Z, a-z, almost all punctation)
>>the bytes are exactly the same.
>>Only special characters need a 2 or sometimes 3 byte
>>sequence.
>>
>>Mixed encoding within a web page isn't really possible.
>>The whole page uses whatever is declared in the encoding
>>header, the DOCUMENT TYPE or in the meta header.
>>You can't switch to another encoding halfway.
>>
>>Again, the single byte non-diacritic characters are mostly
>>the same between ISO-8859-1 and UTF-8, only the encoding
>>of special ones is different.
>>You have to choose one and stick to it. If ISO-8859-1
>>misses a few symbols you need, you'd better use UTF-8
>>everywhere. That way you cover many many more symbols than
>>in any 1 byte encoding.
>>
>>(Anyone please correct me if I'm wrong).

>
>This is what I thought was likely, and it certainly makes sense.
>However, what threw me was concern about out how a browser set to read
>an html file using utf8 would be able to recognize a 1 byte character,
>when it was also expecting 2 or 3 or even 4 byte characters as well.
>Maybe I don't have to worry about this and can proceed on faith alone,
>but it would be both interesting and reassuring to know the answer.
>Does anyone know the way this is accomplished?


Here is how UTF-8 works:
http://en.wikipedia.org/wiki/UTF-8#Description
--
( Kees
)
c[_] Why is the alphabet in that order? Is it because of that song? (#495)
Reply With Quote
  #9 (permalink)  
Old 01-21-2008
Axel Schwenke
 
Posts: n/a
Default Re: mixed encodings - how to manage

Kees Nuyt wrote:
> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
> wrote:
>
>> I have a table in MySQL that is presently entirely encoded with latin1
>> charset. Several of the varchar fields are used as CSS styling values
>> and, as such, do not need to be encoded with utf8. However, two of
>> the fields become visible content in the HTML page, and I want to
>> change the encoding for those two fields to utf8.
>>
>> What kind of header() and meta tag specifiers should I use in a mixed
>> encoding situation like that? If I specify "UTF-8" will the browser
>> be able to automatically distinguish the situations that require
>> reading one byte (latin1) from those requiring more bytes (utf8)?

>
> Using UTF-8 won't hurt.


I strongly disagree.

utf8 adds significant overhead to string processing and in many cases MySQL
has to reserve 3 bytes per utf8 character (i.e. for index- or record size).

The right approach is to use the "minimal" charset that can encode the
characters in question. So in the before mentioned table only the two fields
should be changed to use the utf8 encoding.

> Mixed encoding within a web page isn't really possible.


Right.

MySQL solved this problem with the introduction of the client related
charset settings in 4.1. Now you can store data in i.e. latin1 but all data
the client sends to or retrieves from the database can be i.e. utf8 (and
will automagically be converted).

So the final suggestion is:
- declare utf8 encoding in the HTTP header
- use 'SET NAMES utf8' to declare all client <-> database traffic to use the
utf8 encoding
- store data in the appropriate encoding, use utf8 only where necessary


XL
--
Axel Schwenke, Support Engineer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/
Reply With Quote
  #10 (permalink)  
Old 01-21-2008
firewoodtim@yahoo.com
 
Posts: n/a
Default Re: mixed encodings - how to manage

On Mon, 21 Jan 2008 18:19:52 +0100, Axel Schwenke
<axel.schwenke@gmx.de> wrote:

>Kees Nuyt wrote:
>> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
>> wrote:
>>
>>> I have a table in MySQL that is presently entirely encoded with latin1
>>> charset. Several of the varchar fields are used as CSS styling values
>>> and, as such, do not need to be encoded with utf8. However, two of
>>> the fields become visible content in the HTML page, and I want to
>>> change the encoding for those two fields to utf8.
>>>
>>> What kind of header() and meta tag specifiers should I use in a mixed
>>> encoding situation like that? If I specify "UTF-8" will the browser
>>> be able to automatically distinguish the situations that require
>>> reading one byte (latin1) from those requiring more bytes (utf8)?

>>
>> Using UTF-8 won't hurt.

>
>I strongly disagree.
>
>utf8 adds significant overhead to string processing and in many cases MySQL
>has to reserve 3 bytes per utf8 character (i.e. for index- or record size).
>
>The right approach is to use the "minimal" charset that can encode the
>characters in question. So in the before mentioned table only the two fields
>should be changed to use the utf8 encoding.
>
>> Mixed encoding within a web page isn't really possible.

>
>Right.
>
>MySQL solved this problem with the introduction of the client related
>charset settings in 4.1. Now you can store data in i.e. latin1 but all data
>the client sends to or retrieves from the database can be i.e. utf8 (and
>will automagically be converted).
>
>So the final suggestion is:
>- declare utf8 encoding in the HTTP header
>- use 'SET NAMES utf8' to declare all client <-> database traffic to use the
>utf8 encoding
>- store data in the appropriate encoding, use utf8 only where necessary
>
>
>XL


So, using this suggestion, I could just do the following (using the
same table and fields as in my original description):
1. ALTER TABLE table_name MODIFY varchar_column VARCHAR(50) NOT NULL
CHARACTER SET utf8;
2. ALTER TABLE table_name MODIFY text_column TEXT NOT NULL CHARACTER
SET utf8;
I issue a similar ALTER command for each column in each table that I
want to convert to utf8.

Steps 1 & 2 will take care of converting the data from latin1 to utf8
and will set the default for those columns only to utf8 encoding,
leaving the table encoding default intact at latin1.

3. In my php file, right after connecting to the db, add the line:
mysql_query("SET NAMES 'utf8' ");
4. Set the html header using the line:
header('Content-Type:text/html; charset=UTF-8');
5. Set a meta tag in all my scripts to:
<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8" />

Will that take care of it correctly? Are there any other steps to
take?
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 06:24 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0