View Single Post

  #4 (permalink)  
Old 05-13-2008
Chuck Anderson
 
Posts: n/a
Default Re: fread and UTF-16 encodings

Dikkie Dik wrote:
> Chuck Anderson wrote:
>
>> I am having trouble reading files that are using UTF-16 encoding.
>>
>> I noticed this when I started trying to read an xml file produced by
>> Winamp. And now I see it in the id3tags of files created by Winamp as
>> well.
>>

>
> OK. Is this application-related or a general question? If
> application-related, you will have to learn the encoding from the
> application's documentation itself or just by trying.
>


It's a general question. I've never opened a text file with Php before
that I could not read.

>
>> When I use fread to get the file contents (a playlist), I get this:
>>
>> ÿþ<�?�x�m�l� �v�e�r�s�i�o�n�=�"�1�.�0ï¿ ½"� �e�n�c�o�d�
>> i�n�g�=�"�U�T�F�-�1�6�"�?�>�<�p�l�a�y�l�iï¿ ½s�t�s� �p�l
>> �a�y�l�i�s�t�s�=�"�7�"�>ï¿ ½<�p�l�a�y�l�i�s�t� �f�i�l�e�
>> n�a�m�e�=�"�p�l�f�2�2�C�Dï ¿½.�m�3�u�8�"� ....
>>
>> (I have wrapped that text manually)
>>
>> When I pass that string to xml_parse, however, it properly decodes it
>> and gives me this:
>>
>> <PLAYLISTS PLAYLISTS="7"><PLAYLIST FILENAME="plf22CD.m3u8 ....
>>
>> How do I detect when file data is encoded like this, and then how should
>> I work with it?
>>

>
> If you get the file from the internet or serve it yourself, the encoding
> is in the Content-Type header.


It is a text file created on my WindowsXP PC by Winamp, as are the mp3
files.

> You have just discovered how stupid it is
> to use a meta tag for that: you can't read the encoding because it is
> encoded in the unknown encoding! It is similar to locking the key to a
> safe inside it. But you'd be amazed by how many applications lock away
> the key...
>
> Anyway, there are two utf-16 encodings: Big Endian and Little Endian
> (often abbreviated to utf-16 BE and utf-16 LE). The difference is in the
> order of the byte pairs.
>
> You should be able to convert them with the mb_string functions.
>


Okay, I'm starting to make some sense of this with mb_string functions,
although the output of some functions (like mb_detect_encoding saying a
UTF-16LE string is ISO-8859-1 and mb_check_encoding determines it is
UTF-16BE (if that is checked first), instead of UTF-16LE.

mb_convert_encoding does a pretty good job (but does not leave off the
BOM - byte order mark - ... the first two bytes).

I have only run quick tests this evening on all of this, but that is
what I am seeing.

It appears that iconv actually does the conversion best (skips over the
BOM).

>
>> I have found utf16_decode on the php.net site, but when I use that
>> function I get an empty string.
>>


D'oh! A classic idiot mistake. I could not see the string because it
was all XML and I needed to display it with htmlentities(). The
function actually works fine.

>>

>
> I could not find that function, so is it defined and is your error
> handling ignoring undefined functions?
>


I posted a link to it in a followup. It was contributed by a
contributor to the php.net discussions. Here is the link again:
http://us.php.net/manual/en/function...code.php#49185

This is not needed, however, if you have the mbstring extension (or
iconv, which is included in Php5).

> Good luck!
>


I can see I have a lot to learn, but I think I'm going to be able to
detect encoding with mbstring functions, and so far, it appears that
iconv will do the best job of converting.

Thanks for the pointers!

(Tonight I looked through the getid3_lib ->
http://getid3.sourceforge.net - which I use in my Php mp3 music catalog,
as it displays id tag information from these Winamp modified id tags
properly and I wanted to see how it does it. I found that there is a
lot of code in there to determine and convert between character
encodings. It is all starting to make sense to me now.)

--
*****************************
Chuck Anderson • Boulder, CO
http://www.CycleTourist.com
Nothing he's got he really needs
Twenty first century schizoid man.
***********************************

Reply With Quote