mb_convert_encoding converting to ASCII instead of UTF-8

This is a discussion on mb_convert_encoding converting to ASCII instead of UTF-8 within the PHP General forums, part of the PHP Programming Forums category; I've run into a problem where mb_convert_encoding seems to be converting to ASCII, even though I'm telling it ...


Go Back   Usenet Forums > PHP Programming Forums > PHP General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-23-2008
Robert William Vesterman
 
Posts: n/a
Default mb_convert_encoding converting to ASCII instead of UTF-8

I've run into a problem where mb_convert_encoding seems to be converting
to ASCII, even though I'm telling it to convert to UTF-8. This is with
PHP version 4.3.11.

I had been asking it to convert from "auto" to UTF-8, so at first I
thought maybe "auto" was not the right choice. So I called
"mb_detect_encoding" to see the format of what I was trying to convert;
it said it was already UTF-8 (before I did the conversion).

So then I thought maybe I got the "from" and "to" parameters backwards
(although I was confident I was following the documentation), so I
changed mb_convert_encoding to use "UTF-8" as /both/ the from and to.

It still converts to ASCII.

I understand that, given that it's already UTF-8, I don't need to
convert it to UTF-8. But other things that I receive might /not/ be
UTF-8, so I am still concerned with this.

Sample code:

<html><head><title>Minnie</title></head><body><p>
<?php
$x = $_REQUEST['Minnie'];
echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>';
$x = mb_convert_encoding ( $x, "UTF-8", "UTF-8" );
echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>';
?>
</p></body></html>

Output, when called with URL parameter "Minnie=Miņoso":

Miņoso ... UTF-8
Mioso ... ASCII

Then I changed the "from" so that I could try converting from something
other than UTF-8:

$x = mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) );

And now, output when called with "Minnie=Mouse":

Mouse ... ASCII
Mouse ... ASCII

Does anyone have any idea what's going on here? Am I doing something wrong?

Thanks in advance for any help.

Reply With Quote
  #2 (permalink)  
Old 04-23-2008
Robert William Vesterman
 
Posts: n/a
Default Re: [PHP] mb_convert_encoding converting to ASCII instead of UTF-8

A little additional info: The "ASCII to ASCII" case for "Minnie=Mouse"
is merely because the UTF-8 encoding for "Mouse" is the same as the
ASCII encoding for "Mouse", and mb_detect_encoding is matching on ASCII
before UTF-8. So that's not an issue.

But, the "UTF-8 to ASCII" case for "Minnie=Miņoso" is still (seemingly)
screwy.

Robert William Vesterman wrote:
> I've run into a problem where mb_convert_encoding seems to be
> converting to ASCII, even though I'm telling it to convert to UTF-8.
> This is with PHP version 4.3.11.
>
> I had been asking it to convert from "auto" to UTF-8, so at first I
> thought maybe "auto" was not the right choice. So I called
> "mb_detect_encoding" to see the format of what I was trying to
> convert; it said it was already UTF-8 (before I did the conversion).
> So then I thought maybe I got the "from" and "to" parameters backwards
> (although I was confident I was following the documentation), so I
> changed mb_convert_encoding to use "UTF-8" as /both/ the from and to.
>
> It still converts to ASCII.
>
> I understand that, given that it's already UTF-8, I don't need to
> convert it to UTF-8. But other things that I receive might /not/ be
> UTF-8, so I am still concerned with this.
>
> Sample code:
>
> <html><head><title>Minnie</title></head><body><p>
> <?php
> $x = $_REQUEST['Minnie'];
> echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>';
> $x = mb_convert_encoding ( $x, "UTF-8", "UTF-8" );
> echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>';
> ?>
> </p></body></html>
>
> Output, when called with URL parameter "Minnie=Miņoso":
>
> Miņoso ... UTF-8
> Mioso ... ASCII
>
> Then I changed the "from" so that I could try converting from
> something other than UTF-8:
>
> $x = mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) );
>
> And now, output when called with "Minnie=Mouse":
>
> Mouse ... ASCII
> Mouse ... ASCII
>
> Does anyone have any idea what's going on here? Am I doing something
> wrong?
>
> Thanks in advance for any help.
>
>


Reply With Quote
  #3 (permalink)  
Old 04-23-2008
Robert William Vesterman
 
Posts: n/a
Default Re: [PHP] mb_convert_encoding converting to ASCII instead of UTF-8

OK, now the problem seems to be not that mb_convert_encoding is encoding
incorrectly, it's that mb_detect_encoding is detecting incorrectly.
It's claiming that the raw string as received from the browser is UTF-8,
where in reality it seems to be ISO-8859-1. Sample code:

<html><head><title>Minnie</title></head><body><p>
<?php
function output ( $label, $x ) {
echo $label . ': ' . $x . ' ... ' . mb_detect_encoding ( $x ) .
'<br/>';
}

$x = $_REQUEST['Minnie'];
output ( "Raw", $x );
output ( "Convert from detected",
mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) ) );
output ( "Convert from ISO",
mb_convert_encoding ( $x, "UTF-8", "ISO-8859-1" ) );
?>
</p></body></html>

Output for "Minnie=Mi%F1oso":

Raw: Mi?oso ... UTF-8
Convert from detected: Mioso ... ASCII
Convert from ISO: Miņoso ... UTF-8

Robert William Vesterman wrote:
> A little additional info: The "ASCII to ASCII" case for "Minnie=Mouse"
> is merely because the UTF-8 encoding for "Mouse" is the same as the
> ASCII encoding for "Mouse", and mb_detect_encoding is matching on
> ASCII before UTF-8. So that's not an issue.
>
> But, the "UTF-8 to ASCII" case for "Minnie=Miņoso" is still
> (seemingly) screwy.
>
> Robert William Vesterman wrote:
>> I've run into a problem where mb_convert_encoding seems to be
>> converting to ASCII, even though I'm telling it to convert to UTF-8.
>> This is with PHP version 4.3.11.
>>
>> I had been asking it to convert from "auto" to UTF-8, so at first I
>> thought maybe "auto" was not the right choice. So I called
>> "mb_detect_encoding" to see the format of what I was trying to
>> convert; it said it was already UTF-8 (before I did the conversion).
>> So then I thought maybe I got the "from" and "to" parameters
>> backwards (although I was confident I was following the
>> documentation), so I changed mb_convert_encoding to use "UTF-8" as
>> /both/ the from and to.
>>
>> It still converts to ASCII.
>>
>> I understand that, given that it's already UTF-8, I don't need to
>> convert it to UTF-8. But other things that I receive might /not/ be
>> UTF-8, so I am still concerned with this.
>>
>> Sample code:
>>
>> <html><head><title>Minnie</title></head><body><p>
>> <?php
>> $x = $_REQUEST['Minnie'];
>> echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>';
>> $x = mb_convert_encoding ( $x, "UTF-8", "UTF-8" );
>> echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>';
>> ?>
>> </p></body></html>
>>
>> Output, when called with URL parameter "Minnie=Miņoso":
>>
>> Miņoso ... UTF-8
>> Mioso ... ASCII
>>
>> Then I changed the "from" so that I could try converting from
>> something other than UTF-8:
>>
>> $x = mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) );
>>
>> And now, output when called with "Minnie=Mouse":
>>
>> Mouse ... ASCII
>> Mouse ... ASCII
>>
>> Does anyone have any idea what's going on here? Am I doing something
>> wrong?
>>
>> Thanks in advance for any help.
>>
>>

>
>


Reply With Quote
  #4 (permalink)  
Old 04-23-2008
Robert William Vesterman
 
Posts: n/a
Default Re: [PHP] mb_convert_encoding converting to ASCII instead of UTF-8

And the culprit is that mb_detect_order() wasn't set up to handle
ISO-8859-1. It was "ASCII, UTF-8". Changing it to "ASCII, UTF-8,
ISO-8859-1" makes everything work as expected.

Robert William Vesterman wrote:
> OK, now the problem seems to be not that mb_convert_encoding is
> encoding incorrectly, it's that mb_detect_encoding is detecting
> incorrectly. It's claiming that the raw string as received from the
> browser is UTF-8, where in reality it seems to be ISO-8859-1. Sample
> code:
>
> <html><head><title>Minnie</title></head><body><p>
> <?php
> function output ( $label, $x ) {
> echo $label . ': ' . $x . ' ... ' . mb_detect_encoding ( $x ) .
> '<br/>';
> }
>
> $x = $_REQUEST['Minnie'];
> output ( "Raw", $x );
> output ( "Convert from detected",
> mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) ) );
> output ( "Convert from ISO",
> mb_convert_encoding ( $x, "UTF-8", "ISO-8859-1" ) );
> ?>
> </p></body></html>
>
> Output for "Minnie=Mi%F1oso":
>
> Raw: Mi?oso ... UTF-8
> Convert from detected: Mioso ... ASCII
> Convert from ISO: Miņoso ... UTF-8
>
> Robert William Vesterman wrote:
>> A little additional info: The "ASCII to ASCII" case for
>> "Minnie=Mouse" is merely because the UTF-8 encoding for "Mouse" is
>> the same as the ASCII encoding for "Mouse", and mb_detect_encoding is
>> matching on ASCII before UTF-8. So that's not an issue.
>>
>> But, the "UTF-8 to ASCII" case for "Minnie=Miņoso" is still
>> (seemingly) screwy.
>>
>> Robert William Vesterman wrote:
>>> I've run into a problem where mb_convert_encoding seems to be
>>> converting to ASCII, even though I'm telling it to convert to
>>> UTF-8. This is with PHP version 4.3.11.
>>>
>>> I had been asking it to convert from "auto" to UTF-8, so at first I
>>> thought maybe "auto" was not the right choice. So I called
>>> "mb_detect_encoding" to see the format of what I was trying to
>>> convert; it said it was already UTF-8 (before I did the conversion).
>>> So then I thought maybe I got the "from" and "to" parameters
>>> backwards (although I was confident I was following the
>>> documentation), so I changed mb_convert_encoding to use "UTF-8" as
>>> /both/ the from and to.
>>>
>>> It still converts to ASCII.
>>>
>>> I understand that, given that it's already UTF-8, I don't need to
>>> convert it to UTF-8. But other things that I receive might /not/ be
>>> UTF-8, so I am still concerned with this.
>>>
>>> Sample code:
>>>
>>> <html><head><title>Minnie</title></head><body><p>
>>> <?php
>>> $x = $_REQUEST['Minnie'];
>>> echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>';
>>> $x = mb_convert_encoding ( $x, "UTF-8", "UTF-8" );
>>> echo $x . ' ... ' . mb_detect_encoding ( $x ) . '<br/>';
>>> ?>
>>> </p></body></html>
>>>
>>> Output, when called with URL parameter "Minnie=Miņoso":
>>>
>>> Miņoso ... UTF-8
>>> Mioso ... ASCII
>>>
>>> Then I changed the "from" so that I could try converting from
>>> something other than UTF-8:
>>>
>>> $x = mb_convert_encoding ( $x, "UTF-8", mb_detect_encoding ( $x ) );
>>>
>>> And now, output when called with "Minnie=Mouse":
>>>
>>> Mouse ... ASCII
>>> Mouse ... ASCII
>>>
>>> Does anyone have any idea what's going on here? Am I doing something
>>> wrong?
>>>
>>> Thanks in advance for any help.
>>>
>>>

>>
>>

>
>


Reply With Quote
  #5 (permalink)  
Old 04-23-2008
tedd
 
Posts: n/a
Default Re: [PHP] mb_convert_encoding converting to ASCII instead of UTF-8

At 11:28 AM -0400 4/23/08, Robert William Vesterman wrote:
>A little additional info: The "ASCII to ASCII"
>case for "Minnie=Mouse" is merely because the
>UTF-8 encoding for "Mouse" is the same as the
>ASCII encoding for "Mouse", and
>mb_detect_encoding is matching on ASCII before
>UTF-8. So that's not an issue.
>
>But, the "UTF-8 to ASCII" case for
>"Minnie=Miņoso" is still (seemingly) screwy.


Going for "UTF-8 to ASCII" is not going to work.
The ASCII to UTF-8 works because ASCII is
contained within UTF8. But the reverse is not
true. Not all of UTF-8 is contained within ASCII.

For example, the character (code-point) ņ does
not appear in ASCII, so that doesn't work.

Cheers,

tedd

--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
Reply With Quote
  #6 (permalink)  
Old 04-23-2008
Robert William Vesterman
 
Posts: n/a
Default Re: [PHP] mb_convert_encoding converting to ASCII instead of UTF-8

I wasn't saying I was /telling/ it to go from UTF-8 to ASCII. I was
saying it /was/ going from UTF-8 to ASCII, despite the fact that I was
telling it to go from UTF-8 to UTF-8.

And as noted previously in this thread, it turned out to be because
mb_detect_encoding was /mistakenly/ detecting it as UTF-8 in the first
place. It was actually ISO-8859-1, not UTF-8. So when I told it to
convert from UTF-8 (which mb_detect_encoding said it was),
mb_convert_encoding ran into a non-UTF-8 character (the ņ), and so threw
it away. The generated output was therefore all straight ASCII
characters, which mb_detect_encoding therefore said was ASCII.

tedd wrote:
> At 11:28 AM -0400 4/23/08, Robert William Vesterman wrote:
>> A little additional info: The "ASCII to ASCII" case for
>> "Minnie=Mouse" is merely because the UTF-8 encoding for "Mouse" is
>> the same as the ASCII encoding for "Mouse", and mb_detect_encoding is
>> matching on ASCII before UTF-8. So that's not an issue.
>>
>> But, the "UTF-8 to ASCII" case for "Minnie=Miņoso" is still
>> (seemingly) screwy.

>
> Going for "UTF-8 to ASCII" is not going to work. The ASCII to UTF-8
> works because ASCII is contained within UTF8. But the reverse is not
> true. Not all of UTF-8 is contained within ASCII.
>
> For example, the character (code-point) ņ does not appear in ASCII, so
> that doesn't work.
>
> Cheers,
>
> tedd
>


Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 10:43 PM.


Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0