Replace special characters by non-special characters

This is a discussion on Replace special characters by non-special characters within the PHP Language forums, part of the PHP Programming Forums category; i'm looking for a way to replace special characters with characters without accents, cedilles, etc....


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 11-05-2004
Pikkel
 
Posts: n/a
Default Replace special characters by non-special characters

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.
Reply With Quote
  #2 (permalink)  
Old 11-05-2004
Michael Fesser
 
Posts: n/a
Default Re: Replace special characters by non-special characters

.oO(Pikkel)

>i'm looking for a way to replace special characters with characters
>without accents, cedilles, etc.


Maybe strtr()?

Micha
Reply With Quote
  #3 (permalink)  
Old 11-05-2004
CJ Llewellyn
 
Posts: n/a
Default Re: Replace special characters by non-special characters

"Pikkel" <pikkel@de.wop> wrote in message
news:418beb8b$0$14941$e4fe514c@news.xs4all.nl...
> i'm looking for a way to replace special characters with characters
> without accents, cedilles, etc.


http://uk.php.net/manual/en/function...ecialchars.php



Reply With Quote
  #4 (permalink)  
Old 11-06-2004
Pikkel
 
Posts: n/a
Default Re: Replace special characters by non-special characters

CJ Llewellyn wrote:

> "Pikkel" <pikkel@de.wop> wrote in message
> news:418beb8b$0$14941$e4fe514c@news.xs4all.nl...
>
>>i'm looking for a way to replace special characters with characters
>>without accents, cedilles, etc.

>
>
> http://uk.php.net/manual/en/function...ecialchars.php
>


Thanks for you tip, but i'm not looking for html replacement but
character replacement: á --> a
Reply With Quote
  #5 (permalink)  
Old 11-06-2004
Pikkel
 
Posts: n/a
Default Re: Replace special characters by non-special characters

Michael Fesser wrote:

> .oO(Pikkel)
>
>
>>i'm looking for a way to replace special characters with characters
>>without accents, cedilles, etc.

>
>
> Maybe strtr()?
>
> Micha



i should replace all characters by myself using this function.
i was looking for a complete [accent, cedille, umlaut etc.] strip function
Reply With Quote
  #6 (permalink)  
Old 11-06-2004
Andy Hassall
 
Posts: n/a
Default Re: Replace special characters by non-special characters

On Fri, 05 Nov 2004 22:08:03 +0100, Pikkel <pikkel@de.wop> wrote:

>i'm looking for a way to replace special characters with characters
>without accents, cedilles, etc.


In what character set encoding? If it's a small one, e.g. iso-8859-15, just
list all the accented/non-accented pairs and run it through strtr.

If it's a Unicode variant, it's bit more of a challenge...

--
Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Reply With Quote
  #7 (permalink)  
Old 11-06-2004
lawrence
 
Posts: n/a
Default Re: Replace special characters by non-special characters

Andy Hassall <andy@andyh.co.uk> wrote in message news:<tc4oo0pq1fcqitce1mk5c9d7g2srubsola@4ax.com>. ..
> On Fri, 05 Nov 2004 22:08:03 +0100, Pikkel <pikkel@de.wop> wrote:
>
> >i'm looking for a way to replace special characters with characters
> >without accents, cedilles, etc.

>
> In what character set encoding? If it's a small one, e.g. iso-8859-15, just
> list all the accented/non-accented pairs and run it through strtr.
>
> If it's a Unicode variant, it's bit more of a challenge...


I'm possibly beating this subject to death, but I've yet to think of
an answer to the problem. If a user copies text from a iso-8859-15
page and then pastes it into the textarea of a form and then submits
it to a CMS which then sends it out as UTF-8 one gets garbage
characters, as one can see on this page:

http://www.krubner.com/index.php?pageId=33396

So I'm wondering if there is a way to cycle through and find quote
marks and such that are unique to iso-8859-15?????
Reply With Quote
  #8 (permalink)  
Old 11-06-2004
Andy Hassall
 
Posts: n/a
Default Re: Replace special characters by non-special characters

On 6 Nov 2004 01:19:52 -0800, lkrubner@geocities.com (lawrence) wrote:

>I'm possibly beating this subject to death, but I've yet to think of
>an answer to the problem. If a user copies text from a iso-8859-15
>page and then pastes it into the textarea of a form and then submits
>it to a CMS which then sends it out as UTF-8 one gets garbage
>characters, as one can see on this page:
>
>http://www.krubner.com/index.php?pageId=33396


There's probably a bit more to it than that, such as the encoding of the page
containing the form in the first place. If you just dump out ISO-8859-15
encoded data and pretend it's UTF-8, of course it won't work, except for the
shared ASCII (top bit not set, i.e. <= 127) representations between the two
encodings. I can't remember quite where you got to from the previous threads on
this subject though.

>So I'm wondering if there is a way to cycle through and find quote
>marks and such that are unique to iso-8859-15?????


If it's between ISO-8859-15 and UTF-8, there are no characters unique to
ISO-8859-15, since UTF-8 encodes all those characters and more. Their encoding
differs for all those with encoding >127 from ISO-8859-15 but that's a
different question. The Euro is the same character in both, but has a different
encoding in both.

But anyway, it seems to me that the simple approach is just:

(1) Present the form in UTF-8 in the first place.
(2) The user copies content from one site, in whatever encoding. Their browser
places it on the clipboard in some OS-native encoding which is hopefully
irrelevant.
(3) The user pastes it into the UTF-8 form. The browser converts the characters
into the appropriate encoding.
(4) Post the data; since the source form is UTF-8, the data is sent in UTF-8,
and you're done.
(5) You can then just reject anything that comes in as malformed UTF-8 from the
previous step.

Consider:

Two scripts, one to output iso-8859-15 and the other Codepage 1252 (with the
dread Smart Quotes and all):

<?php header('Content-type: text/html; charset=iso8859-15'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Characters to copy</title>
</head>
<body>
<pre>
<?php
$n = 0;
for ($i=32; $i<255; $i++)
{
if ($i >= 127 && $i <= 159)
continue;

print htmlspecialchars(chr($i), ENT_COMPAT, 'ISO-8859-15');
if ($n++%16 == 15) print "\n";
}
?>
</pre>
</body>


<?php header('Content-type: text/html; charset=Windows-1252"'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Characters to copy</title>
</head>
<body>
<pre>
<?php
$n = 0;
for ($i=32; $i<255; $i++)
{
print htmlspecialchars(chr($i), ENT_COMPAT, 'cp1252');
if ($n++%16 == 15) print "\n";
}
?>
</pre>
</body>


Then utf8form.php, put text in, print back out encoded as utf-8:

<?php header('Content-type: text/html; charset=utf-8'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Outputting</title>
</head>
<body>
<pre>
<?php
if (isset($_POST['text']))
{
print htmlspecialchars($_POST['text'], ENT_COMPAT, 'UTF-8');
}
?>
</pre>

<form method="post" action="utf8form.php" accept-charset="utf-8">
<textarea name="text"></textarea>
<input type="submit">
</form>

</body>
</html>


In Firefox and IE6, this appears to work for me; copying all of the output
from the first pages, which was iso-8859-15 or Codepage 1252, and pasting into
the second page and submitting the form. The output is the same set of
characters, but UTF-8 encoded.

Also worked from other character set encodings; found a page encoded in
Shift-JIS and repeated the steps. The output looked the same to me (although I
can't read Japanese).



OK - that's the purist approach, when all the tools in the chain are
apparently handling encodings properly.

But are you after some more pragmatic approach, something like:

"The data my users send is probably iso8859-1, iso8859-15, codepage 1252, or
maybe utf-8, but it's likely been copied and mangled between applications so I
can't reliably tell which. How do I clean this data up in a reasonable way so
it can be converted to UTF8 for presentation on a UTF8 encoded page?"

If all the data has values <=127 then it's easy - that's all plain ASCII which
is a common subset of all four character sets.

You can at least rule out UTF-8 by using the functions posted in previous
threads looking for malformed UTF-8. If there's a significant number of
characters >127 and it all validates as UTF-8, then the odds of it probably
being UTF-8 increase the more characters above 127 there are, but it's still
not certain.

So you've narrowed it down to one of the three single-byte character sets.

Then the major differences are:

Codepage 1252 has printable characters in the range 128-159 (with a couple of
gaps) wheras the iso8859 encodings only have non-printable characters there. So
if there's data in this range, odds are it's Codepage 1252 - so you can convert
it to UTF-8 from there.

This range holds the angled "smart" quotes, and the em-dash, which are the
characters that cause the most trouble. So alternatively, you could convert
them to plain quotes and dashes if you wanted.

If there's no characters in that range, then you haven't ruled out 1252, but
the rest of the encoding is pretty similar between 1252, iso8859-1 and
iso8859-15

See http://en.wikipedia.org/wiki/ISO_8859-15 for the differences between -1
and -15, the main character worth worrying about most is the Euro (which is
somewhere else again in 1252 - in the 128-159 range I believe).

Is this any help?

--
Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Reply With Quote
  #9 (permalink)  
Old 11-06-2004
Pikkel
 
Posts: n/a
Default Re: Replace special characters by non-special characters

Andy Hassall wrote:

> On 6 Nov 2004 01:19:52 -0800, lkrubner@geocities.com (lawrence) wrote:
>
>
>>I'm possibly beating this subject to death, but I've yet to think of
>>an answer to the problem. If a user copies text from a iso-8859-15
>>page and then pastes it into the textarea of a form and then submits
>>it to a CMS which then sends it out as UTF-8 one gets garbage
>>characters, as one can see on this page:
>>
>>http://www.krubner.com/index.php?pageId=33396

>
>
> There's probably a bit more to it than that, such as the encoding of the page
> containing the form in the first place. If you just dump out ISO-8859-15
> encoded data and pretend it's UTF-8, of course it won't work, except for the
> shared ASCII (top bit not set, i.e. <= 127) representations between the two
> encodings. I can't remember quite where you got to from the previous threads on
> this subject though.
>
>
>>So I'm wondering if there is a way to cycle through and find quote
>>marks and such that are unique to iso-8859-15?????

>
>
> If it's between ISO-8859-15 and UTF-8, there are no characters unique to
> ISO-8859-15, since UTF-8 encodes all those characters and more. Their encoding
> differs for all those with encoding >127 from ISO-8859-15 but that's a
> different question. The Euro is the same character in both, but has a different
> encoding in both.
>
> But anyway, it seems to me that the simple approach is just:
>
> (1) Present the form in UTF-8 in the first place.
> (2) The user copies content from one site, in whatever encoding. Their browser
> places it on the clipboard in some OS-native encoding which is hopefully
> irrelevant.
> (3) The user pastes it into the UTF-8 form. The browser converts the characters
> into the appropriate encoding.
> (4) Post the data; since the source form is UTF-8, the data is sent in UTF-8,
> and you're done.
> (5) You can then just reject anything that comes in as malformed UTF-8 from the
> previous step.
>
> Consider:
>
> Two scripts, one to output iso-8859-15 and the other Codepage 1252 (with the
> dread Smart Quotes and all):
>
> <?php header('Content-type: text/html; charset=iso8859-15'); ?>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
> "http://www.w3.org/TR/html4/loose.dtd">
> <html>
> <head>
> <title>Characters to copy</title>
> </head>
> <body>
> <pre>
> <?php
> $n = 0;
> for ($i=32; $i<255; $i++)
> {
> if ($i >= 127 && $i <= 159)
> continue;
>
> print htmlspecialchars(chr($i), ENT_COMPAT, 'ISO-8859-15');
> if ($n++%16 == 15) print "\n";
> }
> ?>
> </pre>
> </body>
>
>
> <?php header('Content-type: text/html; charset=Windows-1252"'); ?>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
> "http://www.w3.org/TR/html4/loose.dtd">
> <html>
> <head>
> <title>Characters to copy</title>
> </head>
> <body>
> <pre>
> <?php
> $n = 0;
> for ($i=32; $i<255; $i++)
> {
> print htmlspecialchars(chr($i), ENT_COMPAT, 'cp1252');
> if ($n++%16 == 15) print "\n";
> }
> ?>
> </pre>
> </body>
>
>
> Then utf8form.php, put text in, print back out encoded as utf-8:
>
> <?php header('Content-type: text/html; charset=utf-8'); ?>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
> "http://www.w3.org/TR/html4/loose.dtd">
> <html>
> <head>
> <title>Outputting</title>
> </head>
> <body>
> <pre>
> <?php
> if (isset($_POST['text']))
> {
> print htmlspecialchars($_POST['text'], ENT_COMPAT, 'UTF-8');
> }
> ?>
> </pre>
>
> <form method="post" action="utf8form.php" accept-charset="utf-8">
> <textarea name="text"></textarea>
> <input type="submit">
> </form>
>
> </body>
> </html>
>
>
> In Firefox and IE6, this appears to work for me; copying all of the output
> from the first pages, which was iso-8859-15 or Codepage 1252, and pasting into
> the second page and submitting the form. The output is the same set of
> characters, but UTF-8 encoded.
>
> Also worked from other character set encodings; found a page encoded in
> Shift-JIS and repeated the steps. The output looked the same to me (although I
> can't read Japanese).
>
>
>
> OK - that's the purist approach, when all the tools in the chain are
> apparently handling encodings properly.
>
> But are you after some more pragmatic approach, something like:
>
> "The data my users send is probably iso8859-1, iso8859-15, codepage 1252, or
> maybe utf-8, but it's likely been copied and mangled between applications so I
> can't reliably tell which. How do I clean this data up in a reasonable way so
> it can be converted to UTF8 for presentation on a UTF8 encoded page?"
>
> If all the data has values <=127 then it's easy - that's all plain ASCII which
> is a common subset of all four character sets.
>
> You can at least rule out UTF-8 by using the functions posted in previous
> threads looking for malformed UTF-8. If there's a significant number of
> characters >127 and it all validates as UTF-8, then the odds of it probably
> being UTF-8 increase the more characters above 127 there are, but it's still
> not certain.
>
> So you've narrowed it down to one of the three single-byte character sets.
>
> Then the major differences are:
>
> Codepage 1252 has printable characters in the range 128-159 (with a couple of
> gaps) wheras the iso8859 encodings only have non-printable characters there. So
> if there's data in this range, odds are it's Codepage 1252 - so you can convert
> it to UTF-8 from there.
>
> This range holds the angled "smart" quotes, and the em-dash, which are the
> characters that cause the most trouble. So alternatively, you could convert
> them to plain quotes and dashes if you wanted.
>
> If there's no characters in that range, then you haven't ruled out 1252, but
> the rest of the encoding is pretty similar between 1252, iso8859-1 and
> iso8859-15
>
> See http://en.wikipedia.org/wiki/ISO_8859-15 for the differences between -1
> and -15, the main character worth worrying about most is the Euro (which is
> somewhere else again in 1252 - in the 128-159 range I believe).
>
> Is this any help?
>


It's usefull information and I'll remember this. Thank you.
It's not the answer on my question wether there is a function which
converts characters with accents, umlauts and so on, to characters without.
Reply With Quote
  #10 (permalink)  
Old 11-07-2004
Andy Hassall
 
Posts: n/a
Default Re: Replace special characters by non-special characters

On Sat, 06 Nov 2004 22:54:00 +0100, Pikkel <pikkel@de.wop> wrote:

>It's usefull information and I'll remember this. Thank you.
>It's not the answer on my question wether there is a function which
>converts characters with accents, umlauts and so on, to characters without.


True, it's drifted a bit to answer lawrence's questions.

As far as your question goes - no, there isn't a built in function, you'd have
to write one. In order to do so, you have to be a lot more specific about the
character encodings you're using, which characters you want to convert to what,
and exactly what "and so on" means in your last sentence.

--
Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 08:12 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0