how to take a string and weed out characters that are not UTF-8?

This is a discussion on how to take a string and weed out characters that are not UTF-8? within the PHP Language forums, part of the PHP Programming Forums category; What I need to do is find out what characters in a string are not supported by the UTF-8 ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 05-28-2005
lkrubner@geocities.com
 
Posts: n/a
Default how to take a string and weed out characters that are not UTF-8?




What I need to do is find out what characters in a string are not
supported by the UTF-8 encoding. The problem arises when someone logs
in and uses my php script to create a weblog post. They are presented
with a form that has a textarea. If they type in words and then hit
submit, then all is fine. But if they write their entry in WordPerfect
or Microsoft Word or some such, and copy and paste it, then they might
be bringing strange characters into their post.

HTML is forgiving and sends out the wrongly encoded characters, which
show up on the screen as garbage characters. I've decided that I don't
care about this issue. I don't mind garbage characters showing on HTML
pages.

XML is less forgiving, and because of it, I can not get my RSS output
to work. Again, I don't mind garbage characters, but XML is strict and
if it runs into a character that is not in the encoding that is
declared at the top, then it dies.

So what I have to do is, given a string, I have to go through that
string and find everything that is not in the UTF-8 encoding. Then I
need to turn those characters into something harmless - maybe an ASCII
question mark, or something, something in the UTF-8 encoding.

But how is this done? Given a string, how does one go through it and
find all the characters that are not UTF-8? Clearly, the RSS readers do
this easily enough, since they reject my RSS feeds on that ground, but
how do I do it too?





I had to give up on the character encoding issue for a few months, but
I'm back at it now. I think I understand the problem I face a little
clearer now.


This was a good essay:

http://www.joelonsoftware.com/articles/Unicode.html


This was also good:

http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html


This page has some interesting demos:

http://www1.tip.nl/~t876506/UnicodeDisplay.html




Doing what is suggested here sounds nice:

http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6

Where it speaks of "More than one 8-bit repertoire, but predominantly
Latin text", but how does one find out what a character is when you
don't know the encoding?

Reply With Quote
  #2 (permalink)  
Old 05-28-2005
lkrubner@geocities.com
 
Posts: n/a
Default Re: how to take a string and weed out characters that are not UTF-8?

Simon Stienen had some great advice in the following post. Yet even
when I did as he said and looked in Wikipedia, I'm still unclear on how
I determine that something is certainly not UTF-8.


http://groups-beta.google.com/group/...8b9bef7877408d

Simon Stienen Sep 29 2004, 7:37 pm
How validation is done:
Take the string. If there is no character 0x80 to 0xFF, it doesn't
matter,
whether you define this text as UTF-8 or any ISO encoding, since the
first
128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or
higher,
check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a
sequence
of extended ASCII/ANSI characters, too. It's impossible to be sure
about
that.

Reply With Quote
  #3 (permalink)  
Old 05-28-2005
lkrubner@geocities.com
 
Posts: n/a
Default Re: how to take a string and weed out characters that are not UTF-8?

Nevermind. This seems to have solved my problems:

http://uk.php.net/manual/en/function...t-encoding.php

Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 11:32 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0