This is a discussion on how to take a string and weed out characters that are not UTF-8? within the PHP Language forums, part of the PHP Programming Forums category; What I need to do is find out what characters in a string are not supported by the UTF-8 ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
What I need to do is find out what characters in a string are not supported by the UTF-8 encoding. The problem arises when someone logs in and uses my php script to create a weblog post. They are presented with a form that has a textarea. If they type in words and then hit submit, then all is fine. But if they write their entry in WordPerfect or Microsoft Word or some such, and copy and paste it, then they might be bringing strange characters into their post. HTML is forgiving and sends out the wrongly encoded characters, which show up on the screen as garbage characters. I've decided that I don't care about this issue. I don't mind garbage characters showing on HTML pages. XML is less forgiving, and because of it, I can not get my RSS output to work. Again, I don't mind garbage characters, but XML is strict and if it runs into a character that is not in the encoding that is declared at the top, then it dies. So what I have to do is, given a string, I have to go through that string and find everything that is not in the UTF-8 encoding. Then I need to turn those characters into something harmless - maybe an ASCII question mark, or something, something in the UTF-8 encoding. But how is this done? Given a string, how does one go through it and find all the characters that are not UTF-8? Clearly, the RSS readers do this easily enough, since they reject my RSS feeds on that ground, but how do I do it too? I had to give up on the character encoding issue for a few months, but I'm back at it now. I think I understand the problem I face a little clearer now. This was a good essay: http://www.joelonsoftware.com/articles/Unicode.html This was also good: http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html This page has some interesting demos: http://www1.tip.nl/~t876506/UnicodeDisplay.html Doing what is suggested here sounds nice: http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6 Where it speaks of "More than one 8-bit repertoire, but predominantly Latin text", but how does one find out what a character is when you don't know the encoding? |
|
|||
|
Simon Stienen had some great advice in the following post. Yet even
when I did as he said and looked in Wikipedia, I'm still unclear on how I determine that something is certainly not UTF-8. http://groups-beta.google.com/group/...8b9bef7877408d Simon Stienen Sep 29 2004, 7:37 pm How validation is done: Take the string. If there is no character 0x80 to 0xFF, it doesn't matter, whether you define this text as UTF-8 or any ISO encoding, since the first 128 characters all have the same bit sequence in these encodings. However, if there actually *are* characters with a value of 128 or higher, check, whether the given sequence would be a valid UTF-8 sequence (see UTF-8 in Wikipedia for this). If this and every other sequence is valid UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence of extended ASCII/ANSI characters, too. It's impossible to be sure about that. |
|
|||
|
Nevermind. This seems to have solved my problems:
http://uk.php.net/manual/en/function...t-encoding.php |