Html-encode all characters not in the current character set

This is a discussion on Html-encode all characters not in the current character set within the PHP Language forums, part of the PHP Programming Forums category; Hello Is there a function that will allow me to output text written in utf-8 (from db for example) ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-26-2007
Timothy Madden
 
Posts: n/a
Default Html-encode all characters not in the current character set

Hello

Is there a function that will allow me to
output text written in utf-8 (from db for example)
if my document has

Content-Type: text/html; charset=ISO-8859-1

I mean htmlspecialchars() and htmlentities() will only convert
characters that have an associated entity defined in HTML.
I would also like to translate all non-latin1 characters using
numeric references.

&#355 is for a Romanian letter, ţ, for example, and letter ţ
written in UTF-8 is not translated by htmlentities(), even if
I give the function the optional character-set argument, 'UTF-8'
(you can actually see the letter I typed if your system and your
news reader understand and can display ISO latin 2 characters,
encoded in utf-8).

I mean HTML documents can use characters in the entire UNICODE
set, even if the document source is written in ASCII for example,
by encoding any non-ASCII character with HTML entities.

Is there in PHP a function that will encode in HTML all non-ASCII
characters, or all non-latin1 characters, or all characters not in the
source character set ?

Thank you,
Timothy Madden
Reply With Quote
  #2 (permalink)  
Old 04-26-2007
Vince Morgan
 
Posts: n/a
Default Re: Html-encode all characters not in the current character set

"Timothy Madden" <terminatorul@gmail.com> wrote in message
news:4630999f$0$90266$14726298@news.sunsite.dk...
> Hello
>
> Is there a function that will allow me to
> output text written in utf-8 (from db for example)
> if my document has
>
> Content-Type: text/html; charset=ISO-8859-1
>
> I mean htmlspecialchars() and htmlentities() will only convert
> characters that have an associated entity defined in HTML.
> I would also like to translate all non-latin1 characters using
> numeric references.
>
> &#355 is for a Romanian letter, t, for example, and letter t
> written in UTF-8 is not translated by htmlentities(), even if
> I give the function the optional character-set argument, 'UTF-8'
> (you can actually see the letter I typed if your system and your
> news reader understand and can display ISO latin 2 characters,
> encoded in utf-8).
>
> I mean HTML documents can use characters in the entire UNICODE
> set, even if the document source is written in ASCII for example,
> by encoding any non-ASCII character with HTML entities.
>
> Is there in PHP a function that will encode in HTML all non-ASCII
> characters, or all non-latin1 characters, or all characters not in the
> source character set ?


You may find the following link usefull;
http://au.php.net/utf8-decode
HTH
Vince


Reply With Quote
  #3 (permalink)  
Old 04-26-2007
shimmyshack
 
Posts: n/a
Default Re: Html-encode all characters not in the current character set

On Apr 26, 1:23 pm, Timothy Madden <terminato...@gmail.com> wrote:
> Hello
>
> Is there a function that will allow me to
> output text written in utf-8 (from db for example)
> if my document has
>
> Content-Type: text/html; charset=ISO-8859-1
>
> I mean htmlspecialchars() and htmlentities() will only convert
> characters that have an associated entity defined in HTML.
> I would also like to translate all non-latin1 characters using
> numeric references.
>
> &#355 is for a Romanian letter, ţ, for example, and letter ţ
> written in UTF-8 is not translated by htmlentities(), even if
> I give the function the optional character-set argument, 'UTF-8'
> (you can actually see the letter I typed if your system and your
> news reader understand and can display ISO latin 2 characters,
> encoded in utf-8).
>
> I mean HTML documents can use characters in the entire UNICODE
> set, even if the document source is written in ASCII for example,
> by encoding any non-ASCII character with HTML entities.
>
> Is there in PHP a function that will encode in HTML all non-ASCII
> characters, or all non-latin1 characters, or all characters not in the
> source character set ?
>
> Thank you,
> Timothy Madden


also mb_convert_encoding()

Reply With Quote
  #4 (permalink)  
Old 04-27-2007
Willem Bogaerts
 
Posts: n/a
Default Re: Html-encode all characters not in the current character set

> Is there a function that will allow me to
> output text written in utf-8 (from db for example)
> if my document has
>
> Content-Type: text/html; charset=ISO-8859-1
>
> I mean htmlspecialchars() and htmlentities() will only convert
> characters that have an associated entity defined in HTML.
> I would also like to translate all non-latin1 characters using
> numeric references.


There are two terms of interest here: "character set" and "encoding"

ISO-8859-1 is an encoding that only covers a limited character set. So
there is no euro sign, for example. The Bad thing about ISO-8859-1 is
that some programs silently replace it with cp-1252, which is similar
but not exactly the same (it does have a euro sign).


> &#355 is for a Romanian letter, ţ, for example, and letter ţ
> written in UTF-8 is not translated by htmlentities(), even if
> I give the function the optional character-set argument, 'UTF-8'
> (you can actually see the letter I typed if your system and your
> news reader understand and can display ISO latin 2 characters,
> encoded in utf-8).


So you want to encode characters that are NOT in the character set you
explicitly state. If you do want those characters, why do you state an
encoding that does not cover them? If you do want those characters, use
a character set that does have them (like unicode) and an encoding that
covers them (utf-8 is fairly common).

> I mean HTML documents can use characters in the entire UNICODE
> set, even if the document source is written in ASCII for example,
> by encoding any non-ASCII character with HTML entities.


Are you sure about that?

> Is there in PHP a function that will encode in HTML all non-ASCII
> characters, or all non-latin1 characters, or all characters not in the
> source character set ?


The htmlentities function does have an encoding parameter, but you have
already used that. As for numeric entities, I expect them to be
encoding-specific.

Best regards,
--
Willem Bogaerts

Application smith
Kratz B.V.
http://www.kratz.nl/
Reply With Quote
  #5 (permalink)  
Old 04-27-2007
Timothy Madden
 
Posts: n/a
Default Re: Html-encode all characters not in the current character set

Willem Bogaerts wrote:
>> Is there a function that will allow me to
>> output text written in utf-8 (from db for example)
>> if my document has
>>
>> Content-Type: text/html; charset=ISO-8859-1
>>
>> I mean htmlspecialchars() and htmlentities() will only convert
>> characters that have an associated entity defined in HTML.
>> I would also like to translate all non-latin1 characters using
>> numeric references.

>
> There are two terms of interest here: "character set" and "encoding"
>
> ISO-8859-1 is an encoding that only covers a limited character set. So
> there is no euro sign, for example. The Bad thing about ISO-8859-1 is
> that some programs silently replace it with cp-1252, which is similar
> but not exactly the same (it does have a euro sign).
>
>
>> &#355 is for a Romanian letter, ţ, for example, and letter ţ
>> written in UTF-8 is not translated by htmlentities(), even if
>> I give the function the optional character-set argument, 'UTF-8'
>> (you can actually see the letter I typed if your system and your
>> news reader understand and can display ISO latin 2 characters,
>> encoded in utf-8).

>
> So you want to encode characters that are NOT in the character set you
> explicitly state. If you do want those characters, why do you state an
> encoding that does not cover them? If you do want those characters, use
> a character set that does have them (like unicode) and an encoding that
> covers them (utf-8 is fairly common).
>
>> I mean HTML documents can use characters in the entire UNICODE
>> set, even if the document source is written in ASCII for example,
>> by encoding any non-ASCII character with HTML entities.

>
> Are you sure about that?
>
>> Is there in PHP a function that will encode in HTML all non-ASCII
>> characters, or all non-latin1 characters, or all characters not in the
>> source character set ?

>
> The htmlentities function does have an encoding parameter, but you have
> already used that. As for numeric entities, I expect them to be
> encoding-specific.
>
> Best regards,


As I know ISO-8859-1 is a set (of characters).

As you can see in the official HTML 4.01 specification
http://www.w3.org/TR/html401/charset.html#h-5.1
that all HTML documents use UCS defined by ISO10646, which is
identical to UNICODE.

Numeric character references can be used whatever encoding
you chose for your document source, and they always refer to
characters in UCS by their code position.

Timothy Madden,
Romania
Reply With Quote
  #6 (permalink)  
Old 04-27-2007
Timothy Madden
 
Posts: n/a
Default Re: Html-encode all characters not in the current character set

shimmyshack wrote:
> On Apr 26, 1:23 pm, Timothy Madden <terminato...@gmail.com> wrote:
>> Hello
>>
>> Is there a function that will allow me to
>> output text written in utf-8 (from db for example)
>> if my document has
>>
>> Content-Type: text/html; charset=ISO-8859-1
>>
>> I mean htmlspecialchars() and htmlentities() will only convert
>> characters that have an associated entity defined in HTML.
>> I would also like to translate all non-latin1 characters using
>> numeric references.
>>

[...]
>> Thank you,
>> Timothy Madden

>
> also mb_convert_encoding()
>


Actually I think mb_encode_numericentity() is the function I need.

mb_convert_encoding() will just re-encode a string from one encoding
to another, but a Latin-1 source simply can not include Latin-2
characters no matter what encoding I chose. I need numeric character
references defined by HTML for that.

Anyway I think mb_encode_numericentitiy() will work, I just need to know
how to create a map of code-point areas for it to work, and I don't
quite understand how such an area is defined.

Thank you,
Timothy Madden,
Romania
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 05:55 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0