Greetings, Chuck Anderson.
In reply to Your message dated Tuesday, May 13, 2008, 08:39:14,
> Okay, I'm starting to make some sense of this with mb_string functions,
> although the output of some functions (like mb_detect_encoding saying a
> UTF-16LE string is ISO-8859-1 and mb_check_encoding determines it is
> UTF-16BE (if that is checked first), instead of UTF-16LE.
> mb_convert_encoding does a pretty good job (but does not leave off the
> BOM - byte order mark - ... the first two bytes).
These functions working in thought that you've supplied a raw string, not
encoded file.
So if it looks like UTF-encoded, you must strip BOM yourself first.
General hint: if you're working with files - it is much easier to detect
encoding, than when working with user-supplied data.
Check first 3 bytes (3 in case of possible UTF8 mark)
If it is (0xFE 0xFF ...) or (0xFF 0xFE ...) - the file is unicode UTF-16 or
binary file. Strip first two from string and use detect_encoding.
If it is (0xEF 0xBB 0xBF) - the file is in UTF-8 with good chance.
> I have only run quick tests this evening on all of this, but that is
> what I am seeing.
> It appears that iconv actually does the conversion best (skips over the
> BOM).
>> I could not find that function, so is it defined and is your error
>> handling ignoring undefined functions?
>>
> I posted a link to it in a followup. It was contributed by a
> contributor to the php.net discussions. Here is the link again:
> http://us.php.net/manual/en/function...code.php#49185
> This is not needed, however, if you have the mbstring extension (or
> iconv, which is included in Php5).
Actually, i'm using mb_string extension in much deeper way for my shell
scripts.
I just told PHP to encode all my output into OEM codepage (cp866 for me),
so I can use windows encoding in all my scripts and still see correct national
characters whenever it goes to STDOUT.
--
Sincerely Yours, AnrDaemon <anrdaemon@freemail.ru>