fread and UTF-16 encodings

This is a discussion on fread and UTF-16 encodings within the PHP Language forums, part of the PHP Programming Forums category; I am having trouble reading files that are using UTF-16 encoding. I noticed this when I started trying to ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 05-11-2008
Chuck Anderson
 
Posts: n/a
Default fread and UTF-16 encodings

I am having trouble reading files that are using UTF-16 encoding.

I noticed this when I started trying to read an xml file produced by
Winamp. And now I see it in the id3tags of files created by Winamp as well.

When I use fread to get the file contents (a playlist), I get this:

ÿþ<�?�x�m�l� �v�e�r�s�i�o�n�=�"�1�.�0ï¿ ½"� �e�n�c�o�d�
i�n�g�=�"�U�T�F�-�1�6�"�?�>�<�p�l�a�y�l�iï¿ ½s�t�s� �p�l
�a�y�l�i�s�t�s�=�"�7�"�>ï¿ ½<�p�l�a�y�l�i�s�t� �f�i�l�e�
n�a�m�e�=�"�p�l�f�2�2�C�Dï ¿½.�m�3�u�8�"� ....

(I have wrapped that text manually)

When I pass that string to xml_parse, however, it properly decodes it
and gives me this:

<PLAYLISTS PLAYLISTS="7"><PLAYLIST FILENAME="plf22CD.m3u8 ....

How do I detect when file data is encoded like this, and then how should
I work with it?

I have found utf16_decode on the php.net site, but when I use that
function I get an empty string.

--
*****************************
Chuck Anderson • Boulder, CO
http://www.CycleTourist.com
Nothing he's got he really needs
Twenty first century schizoid man.
***********************************

Reply With Quote
  #2 (permalink)  
Old 05-11-2008
Chuck Anderson
 
Posts: n/a
Default Re: fread and UTF-16 encodings

Chuck Anderson wrote:
> I have found utf16_decode on the php.net site, but when I use that
> function I get an empty string.
>


I forgot to include the reference to the utf16_decode function I mentioned.

-- http://us.php.net/manual/en/function...code.php#49185

--
*****************************
Chuck Anderson • Boulder, CO
http://www.CycleTourist.com
Nothing he's got he really needs
Twenty first century schizoid man.
***********************************

Reply With Quote
  #3 (permalink)  
Old 05-12-2008
Dikkie Dik
 
Posts: n/a
Default Re: fread and UTF-16 encodings

Chuck Anderson wrote:
> I am having trouble reading files that are using UTF-16 encoding.
>
> I noticed this when I started trying to read an xml file produced by
> Winamp. And now I see it in the id3tags of files created by Winamp as
> well.


OK. Is this application-related or a general question? If
application-related, you will have to learn the encoding from the
application's documentation itself or just by trying.

> When I use fread to get the file contents (a playlist), I get this:
>
> ÿþ<�?�x�m�l� �v�e�r�s�i�o�n�=�"�1�.�0ï¿ ½"� �e�n�c�o�d�
> i�n�g�=�"�U�T�F�-�1�6�"�?�>�<�p�l�a�y�l�iï¿ ½s�t�s� �p�l
> �a�y�l�i�s�t�s�=�"�7�"�>ï¿ ½<�p�l�a�y�l�i�s�t� �f�i�l�e�
> n�a�m�e�=�"�p�l�f�2�2�C�Dï ¿½.�m�3�u�8�"� ....
>
> (I have wrapped that text manually)
>
> When I pass that string to xml_parse, however, it properly decodes it
> and gives me this:
>
> <PLAYLISTS PLAYLISTS="7"><PLAYLIST FILENAME="plf22CD.m3u8 ....
>
> How do I detect when file data is encoded like this, and then how should
> I work with it?


If you get the file from the internet or serve it yourself, the encoding
is in the Content-Type header. You have just discovered how stupid it is
to use a meta tag for that: you can't read the encoding because it is
encoded in the unknown encoding! It is similar to locking the key to a
safe inside it. But you'd be amazed by how many applications lock away
the key...

Anyway, there are two utf-16 encodings: Big Endian and Little Endian
(often abbreviated to utf-16 BE and utf-16 LE). The difference is in the
order of the byte pairs.

You should be able to convert them with the mb_string functions.

> I have found utf16_decode on the php.net site, but when I use that
> function I get an empty string.
>


I could not find that function, so is it defined and is your error
handling ignoring undefined functions?

Good luck!
Reply With Quote
  #4 (permalink)  
Old 05-13-2008
Chuck Anderson
 
Posts: n/a
Default Re: fread and UTF-16 encodings

Dikkie Dik wrote:
> Chuck Anderson wrote:
>
>> I am having trouble reading files that are using UTF-16 encoding.
>>
>> I noticed this when I started trying to read an xml file produced by
>> Winamp. And now I see it in the id3tags of files created by Winamp as
>> well.
>>

>
> OK. Is this application-related or a general question? If
> application-related, you will have to learn the encoding from the
> application's documentation itself or just by trying.
>


It's a general question. I've never opened a text file with Php before
that I could not read.

>
>> When I use fread to get the file contents (a playlist), I get this:
>>
>> ÿþ<�?�x�m�l� �v�e�r�s�i�o�n�=�"�1�.�0ï¿ ½"� �e�n�c�o�d�
>> i�n�g�=�"�U�T�F�-�1�6�"�?�>�<�p�l�a�y�l�iï¿ ½s�t�s� �p�l
>> �a�y�l�i�s�t�s�=�"�7�"�>ï¿ ½<�p�l�a�y�l�i�s�t� �f�i�l�e�
>> n�a�m�e�=�"�p�l�f�2�2�C�Dï ¿½.�m�3�u�8�"� ....
>>
>> (I have wrapped that text manually)
>>
>> When I pass that string to xml_parse, however, it properly decodes it
>> and gives me this:
>>
>> <PLAYLISTS PLAYLISTS="7"><PLAYLIST FILENAME="plf22CD.m3u8 ....
>>
>> How do I detect when file data is encoded like this, and then how should
>> I work with it?
>>

>
> If you get the file from the internet or serve it yourself, the encoding
> is in the Content-Type header.


It is a text file created on my WindowsXP PC by Winamp, as are the mp3
files.

> You have just discovered how stupid it is
> to use a meta tag for that: you can't read the encoding because it is
> encoded in the unknown encoding! It is similar to locking the key to a
> safe inside it. But you'd be amazed by how many applications lock away
> the key...
>
> Anyway, there are two utf-16 encodings: Big Endian and Little Endian
> (often abbreviated to utf-16 BE and utf-16 LE). The difference is in the
> order of the byte pairs.
>
> You should be able to convert them with the mb_string functions.
>


Okay, I'm starting to make some sense of this with mb_string functions,
although the output of some functions (like mb_detect_encoding saying a
UTF-16LE string is ISO-8859-1 and mb_check_encoding determines it is
UTF-16BE (if that is checked first), instead of UTF-16LE.

mb_convert_encoding does a pretty good job (but does not leave off the
BOM - byte order mark - ... the first two bytes).

I have only run quick tests this evening on all of this, but that is
what I am seeing.

It appears that iconv actually does the conversion best (skips over the
BOM).

>
>> I have found utf16_decode on the php.net site, but when I use that
>> function I get an empty string.
>>


D'oh! A classic idiot mistake. I could not see the string because it
was all XML and I needed to display it with htmlentities(). The
function actually works fine.

>>

>
> I could not find that function, so is it defined and is your error
> handling ignoring undefined functions?
>


I posted a link to it in a followup. It was contributed by a
contributor to the php.net discussions. Here is the link again:
http://us.php.net/manual/en/function...code.php#49185

This is not needed, however, if you have the mbstring extension (or
iconv, which is included in Php5).

> Good luck!
>


I can see I have a lot to learn, but I think I'm going to be able to
detect encoding with mbstring functions, and so far, it appears that
iconv will do the best job of converting.

Thanks for the pointers!

(Tonight I looked through the getid3_lib ->
http://getid3.sourceforge.net - which I use in my Php mp3 music catalog,
as it displays id tag information from these Winamp modified id tags
properly and I wanted to see how it does it. I found that there is a
lot of code in there to determine and convert between character
encodings. It is all starting to make sense to me now.)

--
*****************************
Chuck Anderson • Boulder, CO
http://www.CycleTourist.com
Nothing he's got he really needs
Twenty first century schizoid man.
***********************************

Reply With Quote
  #5 (permalink)  
Old 3 Weeks Ago
AnrDaemon
 
Posts: n/a
Default Re: fread and UTF-16 encodings

Greetings, Chuck Anderson.
In reply to Your message dated Tuesday, May 13, 2008, 08:39:14,

> Okay, I'm starting to make some sense of this with mb_string functions,
> although the output of some functions (like mb_detect_encoding saying a
> UTF-16LE string is ISO-8859-1 and mb_check_encoding determines it is
> UTF-16BE (if that is checked first), instead of UTF-16LE.


> mb_convert_encoding does a pretty good job (but does not leave off the
> BOM - byte order mark - ... the first two bytes).


These functions working in thought that you've supplied a raw string, not
encoded file.
So if it looks like UTF-encoded, you must strip BOM yourself first.

General hint: if you're working with files - it is much easier to detect
encoding, than when working with user-supplied data.
Check first 3 bytes (3 in case of possible UTF8 mark)
If it is (0xFE 0xFF ...) or (0xFF 0xFE ...) - the file is unicode UTF-16 or
binary file. Strip first two from string and use detect_encoding.
If it is (0xEF 0xBB 0xBF) - the file is in UTF-8 with good chance.

> I have only run quick tests this evening on all of this, but that is
> what I am seeing.


> It appears that iconv actually does the conversion best (skips over the
> BOM).


>> I could not find that function, so is it defined and is your error
>> handling ignoring undefined functions?
>>


> I posted a link to it in a followup. It was contributed by a
> contributor to the php.net discussions. Here is the link again:
> http://us.php.net/manual/en/function...code.php#49185


> This is not needed, however, if you have the mbstring extension (or
> iconv, which is included in Php5).


Actually, i'm using mb_string extension in much deeper way for my shell
scripts.
I just told PHP to encode all my output into OEM codepage (cp866 for me),
so I can use windows encoding in all my scripts and still see correct national
characters whenever it goes to STDOUT.


--
Sincerely Yours, AnrDaemon <anrdaemon@freemail.ru>

Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 02:33 AM.


Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0