UTF-8 not decoding

This is a discussion on UTF-8 not decoding within the PHP Language forums, part of the PHP Programming Forums category; Hi, I am opening a stream that is UTF encoded. I use fgetc to read the stream- which is binary ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 08-12-2004
steve
 
Posts: n/a
Default UTF-8 not decoding

Hi,
I am opening a stream that is UTF encoded. I use fgetc to read the
stream- which is binary safe. I add every character read to a string.


But when I look at the stream, I see some characters with a bunch of
"?" question markets, and then utf8_decode has no effect on it
either.

How do you go about decoding utf. Does adding the characters to the
string somehow mess it up. Please help. Running 4.3.4 PHP on Win.

--
http://www.dbForumz.com/ This article was posted by author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbForumz.com/PHP-UTF-deco...ict138860.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbForumz.com/eform.php?p=464220
Reply With Quote
  #2 (permalink)  
Old 08-12-2004
Chung Leong
 
Posts: n/a
Default Re: UTF-8 not decoding

"steve" <UseLinkToEmail@dbForumz.com> wrote in message
news:411ab511$1_7@news.athenanews.com...
> Hi,
> I am opening a stream that is UTF encoded. I use fgetc to read the
> stream- which is binary safe. I add every character read to a string.
>
>
> But when I look at the stream, I see some characters with a bunch of
> "?" question markets, and then utf8_decode has no effect on it
> either.


Question marks means that there're Unicode characters that aren't found
within the current codepage. Basically the characters are there, they're
just represented by ?s.

utf8_decode() does have an effect: it replaces characters outside of
ISO-8859-1 with question marks.

> How do you go about decoding utf. Does adding the characters to the
> string somehow mess it up. Please help. Running 4.3.4 PHP on Win.


The question is, what do you mean by decoding UTF8. Using fgetc on UTF8 text
is not a good idea, since one Unicode character can span multiple bytes.


Reply With Quote
  #3 (permalink)  
Old 08-12-2004
steve
 
Posts: n/a
Default Re: Re: UTF-8 not decoding

"Chung Leong" wrote:
> "steve" <UseLinkToEmail@dbForumz.com> wrote in message
> news:411ab511

_7@news.athenanews.com...
> > Hi,
> > I am opening a stream that is UTF encoded. I use fgetc to read

> the
> > stream- which is binary safe. I add every character read to a

> string.
> >
> >
> > But when I look at the stream, I see some characters with a bunch

> of
> > "?" question markets, and then utf8_decode has no effect on it
> > either.

>
> Question marks means that there’re Unicode characters that
> aren’t found
> within the current codepage. Basically the characters are there,
> they’re
> just represented by ?s.
>
> utf8_decode() does have an effect: it replaces characters outside

of
> ISO-8859-1 with question marks.
>
> > How do you go about decoding utf. Does adding the characters to

> the
> > string somehow mess it up. Please help. Running 4.3.4 PHP on

> Win.
>
> The question is, what do you mean by decoding UTF8. Using fgetc on
> UTF8 text
> is not a good idea, since one Unicode character can span multiple
> bytes.


Thanks, Chung. I am interested in decoding usenet message headers that
look like this:
"=?Utf-8?B?YmVsZGVyYXo=?="

--
http://www.dbForumz.com/ This article was posted by author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbForumz.com/PHP-UTF-deco...ict138860.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbForumz.com/eform.php?p=464367
Reply With Quote
  #4 (permalink)  
Old 08-12-2004
steve
 
Posts: n/a
Default Re: Re: UTF-8 not decoding

"steve" wrote:
> [quote:eff0459c7e="Chung Leong"]"steve"
> <UseLinkToEmail@dbForumz.com> wrote in message
> news:411ab511

_7@news.athenanews.com...
> > Hi,
> > I am opening a stream that is UTF encoded. I use fgetc to read

> the
> > stream- which is binary safe. I add every character read to a

> string.
> >
> >
> > But when I look at the stream, I see some characters with a bunch

> of
> > "?" question markets, and then utf8_decode has no effect on it
> > either.

>
> Question marks means that there’re Unicode characters that
> aren’t found
> within the current codepage. Basically the characters are there,
> they’re
> just represented by ?s.
>
> utf8_decode() does have an effect: it replaces characters outside

of
> ISO-8859-1 with question marks.
>
> > How do you go about decoding utf. Does adding the characters to

> the
> > string somehow mess it up. Please help. Running 4.3.4 PHP on

> Win.
>
> The question is, what do you mean by decoding UTF8. Using fgetc on
> UTF8 text
> is not a good idea, since one Unicode character can span multiple
> bytes.


Thanks, Chung. I am interested in decoding usenet message headers that
look like this:
"=?Utf-8?B?YmVsZGVyYXo=?="[/quote:eff0459c7e]

Ok, figured it out. Take a string like this:
$instr = "=?Utf-8?B?YmVsZGVyYXo=?="

and feed it as argument to this function:
function decode_subject( $instr ) {
$enstr = $instr;
while( preg_match(
’/^([^?]+)?=\?[^?]+\?(B|Q)\?([^?]+)=?=?\?=(.+)?$/i’, $enstr,
$match ) ) {
if( $match[2] == ’b’ || $match[2] == ’B’ )
$enstr = $match[1] . base64_decode( $match[3] ) .
(isset($match[4])?$match[4]:’’);
else
$enstr = $match[1] . quoted_printable_decode( $match[3] );
}
return( $enstr );
}

and it will return the ascii equivalent.

The function is included in: PHP Newsreader
http://pnews.sourceforge.net/

--
http://www.dbForumz.com/ This article was posted by author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbForumz.com/PHP-UTF-deco...ict138860.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbForumz.com/eform.php?p=464416
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 09:15 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0