This is a discussion on mb_convert_encoding converting to ASCII instead of UTF-8 within the PHP General forums, part of the PHP Programming Forums category; I've run into a problem where mb_convert_encoding seems to be converting to ASCII, even though I'm telling it ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
I've run into a problem where mb_convert_encoding seems to be converting
to ASCII, even though I'm telling it to convert to UTF-8. This is with PHP version 4.3.11. I had been asking it to convert from "auto" to UTF-8, so at first I thought maybe "auto" was not the right choice. So I called "mb_detect_encoding" to see the format of what I was trying to convert; it said it was already UTF-8 (before I did the conversion). So then I thought maybe I got the "from" and "to" parameters backwards (although I was confident I was following the documentation), so I changed mb_convert_encoding to use "UTF-8" as /both/ the from and to. It still converts to ASCII. I understand that, given that it's already UTF-8, I don't need to convert it to UTF-8. But other things that I receive might /not/ be UTF-8, so I am still concerned with this. Sample code: <html><head><title>Minnie</title></head><body><p> <?php $x = $_REQUEST['Minnie']; echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>'; $x = mb_convert_encoding ( $x, "UTF-8", "UTF-8" ); echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>'; ?> </p></body></html> Output, when called with URL parameter "Minnie=Miņoso": Miņoso ... UTF-8 Mioso ... ASCII Then I changed the "from" so that I could try converting from something other than UTF-8: $x = mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) ); And now, output when called with "Minnie=Mouse": Mouse ... ASCII Mouse ... ASCII Does anyone have any idea what's going on here? Am I doing something wrong? Thanks in advance for any help. |
|
|||
|
A little additional info: The "ASCII to ASCII" case for "Minnie=Mouse"
is merely because the UTF-8 encoding for "Mouse" is the same as the ASCII encoding for "Mouse", and mb_detect_encoding is matching on ASCII before UTF-8. So that's not an issue. But, the "UTF-8 to ASCII" case for "Minnie=Miņoso" is still (seemingly) screwy. Robert William Vesterman wrote: > I've run into a problem where mb_convert_encoding seems to be > converting to ASCII, even though I'm telling it to convert to UTF-8. > This is with PHP version 4.3.11. > > I had been asking it to convert from "auto" to UTF-8, so at first I > thought maybe "auto" was not the right choice. So I called > "mb_detect_encoding" to see the format of what I was trying to > convert; it said it was already UTF-8 (before I did the conversion). > So then I thought maybe I got the "from" and "to" parameters backwards > (although I was confident I was following the documentation), so I > changed mb_convert_encoding to use "UTF-8" as /both/ the from and to. > > It still converts to ASCII. > > I understand that, given that it's already UTF-8, I don't need to > convert it to UTF-8. But other things that I receive might /not/ be > UTF-8, so I am still concerned with this. > > Sample code: > > <html><head><title>Minnie</title></head><body><p> > <?php > $x = $_REQUEST['Minnie']; > echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>'; > $x = mb_convert_encoding ( $x, "UTF-8", "UTF-8" ); > echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>'; > ?> > </p></body></html> > > Output, when called with URL parameter "Minnie=Miņoso": > > Miņoso ... UTF-8 > Mioso ... ASCII > > Then I changed the "from" so that I could try converting from > something other than UTF-8: > > $x = mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) ); > > And now, output when called with "Minnie=Mouse": > > Mouse ... ASCII > Mouse ... ASCII > > Does anyone have any idea what's going on here? Am I doing something > wrong? > > Thanks in advance for any help. > > |
|
|||
|
OK, now the problem seems to be not that mb_convert_encoding is encoding
incorrectly, it's that mb_detect_encoding is detecting incorrectly. It's claiming that the raw string as received from the browser is UTF-8, where in reality it seems to be ISO-8859-1. Sample code: <html><head><title>Minnie</title></head><body><p> <?php function output ( $label, $x ) { echo $label . ': ' . $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>'; } $x = $_REQUEST['Minnie']; output ( "Raw", $x ); output ( "Convert from detected", mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) ) ); output ( "Convert from ISO", mb_convert_encoding ( $x, "UTF-8", "ISO-8859-1" ) ); ?> </p></body></html> Output for "Minnie=Mi%F1oso": Raw: Mi?oso ... UTF-8 Convert from detected: Mioso ... ASCII Convert from ISO: Miņoso ... UTF-8 Robert William Vesterman wrote: > A little additional info: The "ASCII to ASCII" case for "Minnie=Mouse" > is merely because the UTF-8 encoding for "Mouse" is the same as the > ASCII encoding for "Mouse", and mb_detect_encoding is matching on > ASCII before UTF-8. So that's not an issue. > > But, the "UTF-8 to ASCII" case for "Minnie=Miņoso" is still > (seemingly) screwy. > > Robert William Vesterman wrote: >> I've run into a problem where mb_convert_encoding seems to be >> converting to ASCII, even though I'm telling it to convert to UTF-8. >> This is with PHP version 4.3.11. >> >> I had been asking it to convert from "auto" to UTF-8, so at first I >> thought maybe "auto" was not the right choice. So I called >> "mb_detect_encoding" to see the format of what I was trying to >> convert; it said it was already UTF-8 (before I did the conversion). >> So then I thought maybe I got the "from" and "to" parameters >> backwards (although I was confident I was following the >> documentation), so I changed mb_convert_encoding to use "UTF-8" as >> /both/ the from and to. >> >> It still converts to ASCII. >> >> I understand that, given that it's already UTF-8, I don't need to >> convert it to UTF-8. But other things that I receive might /not/ be >> UTF-8, so I am still concerned with this. >> >> Sample code: >> >> <html><head><title>Minnie</title></head><body><p> >> <?php >> $x = $_REQUEST['Minnie']; >> echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>'; >> $x = mb_convert_encoding ( $x, "UTF-8", "UTF-8" ); >> echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>'; >> ?> >> </p></body></html> >> >> Output, when called with URL parameter "Minnie=Miņoso": >> >> Miņoso ... UTF-8 >> Mioso ... ASCII >> >> Then I changed the "from" so that I could try converting from >> something other than UTF-8: >> >> $x = mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) ); >> >> And now, output when called with "Minnie=Mouse": >> >> Mouse ... ASCII >> Mouse ... ASCII >> >> Does anyone have any idea what's going on here? Am I doing something >> wrong? >> >> Thanks in advance for any help. >> >> > > |
|
|||
|
And the culprit is that mb_detect_order() wasn't set up to handle
ISO-8859-1. It was "ASCII, UTF-8". Changing it to "ASCII, UTF-8, ISO-8859-1" makes everything work as expected. Robert William Vesterman wrote: > OK, now the problem seems to be not that mb_convert_encoding is > encoding incorrectly, it's that mb_detect_encoding is detecting > incorrectly. It's claiming that the raw string as received from the > browser is UTF-8, where in reality it seems to be ISO-8859-1. Sample > code: > > <html><head><title>Minnie</title></head><body><p> > <?php > function output ( $label, $x ) { > echo $label . ': ' . $x . ' ... ' . mb_detect_encoding ( $x ) . > '<br/>'; > } > > $x = $_REQUEST['Minnie']; > output ( "Raw", $x ); > output ( "Convert from detected", > mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) ) ); > output ( "Convert from ISO", > mb_convert_encoding ( $x, "UTF-8", "ISO-8859-1" ) ); > ?> > </p></body></html> > > Output for "Minnie=Mi%F1oso": > > Raw: Mi?oso ... UTF-8 > Convert from detected: Mioso ... ASCII > Convert from ISO: Miņoso ... UTF-8 > > Robert William Vesterman wrote: >> A little additional info: The "ASCII to ASCII" case for >> "Minnie=Mouse" is merely because the UTF-8 encoding for "Mouse" is >> the same as the ASCII encoding for "Mouse", and mb_detect_encoding is >> matching on ASCII before UTF-8. So that's not an issue. >> >> But, the "UTF-8 to ASCII" case for "Minnie=Miņoso" is still >> (seemingly) screwy. >> >> Robert William Vesterman wrote: >>> I've run into a problem where mb_convert_encoding seems to be >>> converting to ASCII, even though I'm telling it to convert to >>> UTF-8. This is with PHP version 4.3.11. >>> >>> I had been asking it to convert from "auto" to UTF-8, so at first I >>> thought maybe "auto" was not the right choice. So I called >>> "mb_detect_encoding" to see the format of what I was trying to >>> convert; it said it was already UTF-8 (before I did the conversion). >>> So then I thought maybe I got the "from" and "to" parameters >>> backwards (although I was confident I was following the >>> documentation), so I changed mb_convert_encoding to use "UTF-8" as >>> /both/ the from and to. >>> >>> It still converts to ASCII. >>> >>> I understand that, given that it's already UTF-8, I don't need to >>> convert it to UTF-8. But other things that I receive might /not/ be >>> UTF-8, so I am still concerned with this. >>> >>> Sample code: >>> >>> <html><head><title>Minnie</title></head><body><p> >>> <?php >>> $x = $_REQUEST['Minnie']; >>> echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>'; >>> $x = mb_convert_encoding ( $x, "UTF-8", "UTF-8" ); >>> echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>'; >>> ?> >>> </p></body></html> >>> >>> Output, when called with URL parameter "Minnie=Miņoso": >>> >>> Miņoso ... UTF-8 >>> Mioso ... ASCII >>> >>> Then I changed the "from" so that I could try converting from >>> something other than UTF-8: >>> >>> $x = mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) ); >>> >>> And now, output when called with "Minnie=Mouse": >>> >>> Mouse ... ASCII >>> Mouse ... ASCII >>> >>> Does anyone have any idea what's going on here? Am I doing something >>> wrong? >>> >>> Thanks in advance for any help. >>> >>> >> >> > > |
|
|||
|
At 11:28 AM -0400 4/23/08, Robert William Vesterman wrote:
>A little additional info: The "ASCII to ASCII" >case for "Minnie=Mouse" is merely because the >UTF-8 encoding for "Mouse" is the same as the >ASCII encoding for "Mouse", and >mb_detect_encoding is matching on ASCII before >UTF-8. So that's not an issue. > >But, the "UTF-8 to ASCII" case for >"Minnie=Miņoso" is still (seemingly) screwy. Going for "UTF-8 to ASCII" is not going to work. The ASCII to UTF-8 works because ASCII is contained within UTF8. But the reverse is not true. Not all of UTF-8 is contained within ASCII. For example, the character (code-point) ņ does not appear in ASCII, so that doesn't work. Cheers, tedd -- ------- http://sperling.com http://ancientstones.com http://earthstones.com |
|
|||
|
I wasn't saying I was /telling/ it to go from UTF-8 to ASCII. I was
saying it /was/ going from UTF-8 to ASCII, despite the fact that I was telling it to go from UTF-8 to UTF-8. And as noted previously in this thread, it turned out to be because mb_detect_encoding was /mistakenly/ detecting it as UTF-8 in the first place. It was actually ISO-8859-1, not UTF-8. So when I told it to convert from UTF-8 (which mb_detect_encoding said it was), mb_convert_encoding ran into a non-UTF-8 character (the ņ), and so threw it away. The generated output was therefore all straight ASCII characters, which mb_detect_encoding therefore said was ASCII. tedd wrote: > At 11:28 AM -0400 4/23/08, Robert William Vesterman wrote: >> A little additional info: The "ASCII to ASCII" case for >> "Minnie=Mouse" is merely because the UTF-8 encoding for "Mouse" is >> the same as the ASCII encoding for "Mouse", and mb_detect_encoding is >> matching on ASCII before UTF-8. So that's not an issue. >> >> But, the "UTF-8 to ASCII" case for "Minnie=Miņoso" is still >> (seemingly) screwy. > > Going for "UTF-8 to ASCII" is not going to work. The ASCII to UTF-8 > works because ASCII is contained within UTF8. But the reverse is not > true. Not all of UTF-8 is contained within ASCII. > > For example, the character (code-point) ņ does not appear in ASCII, so > that doesn't work. > > Cheers, > > tedd > |
![]() |
| Thread Tools | |
| Display Modes | |
|
|