This is a discussion on how to tell server from PHP that charset is UTF-8?? within the PHP Language forums, part of the PHP Programming Forums category; How do I get PHP to tell the server that when I echo text to the screen, I need for ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
How do I get PHP to tell the server that when I echo text to the
screen, I need for the text to be sent as UTF-8? How does Apache know the right encoding when all the text is being generated by PHP? If I build a content management system (I have) and I make sure that all input is encoded as UTF-8, how will the server know that the text in the MySql database is UTF-8? I'm taking all user input and using this function on the input: http://us4.php.net/manual/en/function.utf8-encode.php I'm doing this so I can output to XML without getting errors about "You should not sent plain text". But how will the server know how to serve these pages? How do I tell it from PHP? I realize I can send a http equiv tag, but that's rather weak, right? Is this enough? Any conflicts with Apache? $sent = headers_sent(); if (!$sent) header("Content-type:text/html;charset:UTF-8"); |
|
|||
|
On 4 Sep 2004 09:08:41 -0700, lkrubner@geocities.com (lawrence) wrote:
>How do I get PHP to tell the server that when I echo text to the >screen, I need for the text to be sent as UTF-8? Sent a content-type header with a charset attribute. > How does Apache know >the right encoding when all the text is being generated by PHP? It doesn't, nor does it need to - that information's just for the end user. > If I >build a content management system (I have) and I make sure that all >input is encoded as UTF-8, how will the >server know that the text in the MySql database is UTF-8? > >I'm taking all user input and using this function on the input: > >http://us4.php.net/manual/en/function.utf8-encode.php > >I'm doing this so I can output to XML without getting errors about >"You should not sent plain text". Don't know what you mean here. XML content doesn't have to be UTF-8 encoded, just properly escaped and the encoding set correctly. >But how will the server know how to serve these pages? How do I tell >it from PHP? I realize I can send a http equiv tag, but that's rather >weak, right? Yep. >Is this enough? Any conflicts with Apache? > > $sent = headers_sent(); > if (!$sent) header("Content-type:text/html;charset:UTF-8"); Shouldn't the : after charset be an = sign? i.e. Content-type: text/html; charset=utf-8 That would be enough, provided it's actually sent (i.e. $sent is false). -- Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk> <http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool |
|
|||
|
Andy Hassall <andy@andyh.co.uk> wrote in message news:<4iqjj0taek5o0ni8ck050tam046bq6tn8o@4ax.com>. ..
> On 4 Sep 2004 09:08:41 -0700, lkrubner@geocities.com (lawrence) wrote: > > >How do I get PHP to tell the server that when I echo text to the > >screen, I need for the text to be sent as UTF-8? > > Sent a content-type header with a charset attribute. > > > How does Apache know > >the right encoding when all the text is being generated by PHP? > > It doesn't, nor does it need to - that information's just for the end user. I'm not sure if I follow you here. Yes, the information is for the end user, or rather, the web browser (or other ua) that the end user is using. But something has to send that information out from the webserver. Normally Apache has some idea what it is dealing with, and sends some kind of info, yes? A weaker solution is send a meta http-equiv tag specifying the charset. But something somewhere has to send that info. If the web server has no way to know the charset because all the characters are being generated by PHP, the PHP should send a charset header, yes? By the way, in general, when you use echo or print in PHP, what is the charset of the text being generated? Raw ASCII? > >I'm doing this so I can output to XML without getting errors about > >"You should not sent plain text". > > Don't know what you mean here. XML content doesn't have to be UTF-8 encoded, > just properly escaped and the encoding set correctly. Let's put it this way. Right now users can input whatever the hell they want. Sometimes they write an essay in Microsoft Word and then copy and paste the text to the input form, and input that as a weblog entry. That post then gets added to the RSS feed for that weblog. At first I tried to write my RSS output using Plain Text, but most validators throw an error at that (all but radioland's). So I need to give it a charset. So I decided to give all outgoing XML the charset of UTF-8. Then I immediately started getting errors because lots of users had input stuff that was not UTF-8. So what I need to do is take all input and cast it to UTF-8. If that happens to change some characters to garbage characters, that is fine - that throws the problem back at the user, which is where I want it. I merely need to let them see that they are being idiots. I'll tell them they need to save any text from Microsoft Word as plain text. Once they start doing that, then they won't get garbage characters and the software will output valid XML and RSS. > >Is this enough? Any conflicts with Apache? > > > > $sent = headers_sent(); > > if (!$sent) header("Content-type:text/html;charset:UTF-8"); > > Shouldn't the : after charset be an = sign? i.e. > > Content-type: text/html; charset=utf-8 > > That would be enough, provided it's actually sent (i.e. $sent is false). Thanks for catching the bit about the equal sign. |
|
|||
|
try header('content-type:text/html; charset=UTF-8');
-- Tony Marston http://www.tonymarston.net "lawrence" <lkrubner@geocities.com> wrote in message news:da7e68e8.0409121014.545f155d@posting.google.c om... > Andy Hassall <andy@andyh.co.uk> wrote in message > news:<4iqjj0taek5o0ni8ck050tam046bq6tn8o@4ax.com>. .. >> On 4 Sep 2004 09:08:41 -0700, lkrubner@geocities.com (lawrence) wrote: >> >> >How do I get PHP to tell the server that when I echo text to the >> >screen, I need for the text to be sent as UTF-8? >> >> Sent a content-type header with a charset attribute. >> >> > How does Apache know >> >the right encoding when all the text is being generated by PHP? >> >> It doesn't, nor does it need to - that information's just for the end >> user. > > I'm not sure if I follow you here. Yes, the information is for the end > user, or rather, the web browser (or other ua) that the end user is > using. But something has to send that information out from the > webserver. Normally Apache has some idea what it is dealing with, and > sends some kind of info, yes? A weaker solution is send a meta > http-equiv tag specifying the charset. But something somewhere has to > send that info. If the web server has no way to know the charset > because all the characters are being generated by PHP, the PHP should > send a charset header, yes? > > By the way, in general, when you use echo or print in PHP, what is the > charset of the text being generated? Raw ASCII? > > > > > > >> >I'm doing this so I can output to XML without getting errors about >> >"You should not sent plain text". >> >> Don't know what you mean here. XML content doesn't have to be UTF-8 >> encoded, >> just properly escaped and the encoding set correctly. > > Let's put it this way. Right now users can input whatever the hell > they want. Sometimes they write an essay in Microsoft Word and then > copy and paste the text to the input form, and input that as a weblog > entry. That post then gets added to the RSS feed for that weblog. At > first I tried to write my RSS output using Plain Text, but most > validators throw an error at that (all but radioland's). So I need to > give it a charset. So I decided to give all outgoing XML the charset > of UTF-8. Then I immediately started getting errors because lots of > users had input stuff that was not UTF-8. So what I need to do is take > all input and cast it to UTF-8. If that happens to change some > characters to garbage characters, that is fine - that throws the > problem back at the user, which is where I want it. I merely need to > let them see that they are being idiots. I'll tell them they need to > save any text from Microsoft Word as plain text. Once they start doing > that, then they won't get garbage characters and the software will > output valid XML and RSS. > > > > > > >> >Is this enough? Any conflicts with Apache? >> > >> > $sent = headers_sent(); >> > if (!$sent) header("Content-type:text/html;charset:UTF-8"); >> >> Shouldn't the : after charset be an = sign? i.e. >> >> Content-type: text/html; charset=utf-8 >> >> That would be enough, provided it's actually sent (i.e. $sent is false). > > Thanks for catching the bit about the equal sign. |
|
|||
|
"Tony Marston" <tony@NOSPAM.demon.co.uk> wrote in message news:<ci279j$b7s$1$830fa795@news.demon.co.uk>...
> try header('content-type:text/html; charset=UTF-8'); The only difference I see in what you wrote is that "content" starts with a lower case "c". Are you saying these headers are case sensitive? |
|
|||
|
"lawrence" <lkrubner@geocities.com> wrote in message news:da7e68e8.0409171649.3486795a@posting.google.c om... > "Tony Marston" <tony@NOSPAM.demon.co.uk> wrote in message > news:<ci279j$b7s$1$830fa795@news.demon.co.uk>... >> try header('content-type:text/html; charset=UTF-8'); > > The only difference I see in what you wrote is that "content" starts > with a lower case "c". Are you saying these headers are case > sensitive? No, but that is what I use and it works. -- Tony Marston http://www.tonymarston.net |
|
|||
|
On 12 Sep 2004 11:14:10 -0700, lkrubner@geocities.com (lawrence) wrote:
>Andy Hassall <andy@andyh.co.uk> wrote in message news:<4iqjj0taek5o0ni8ck050tam046bq6tn8o@4ax.com>. .. >> On 4 Sep 2004 09:08:41 -0700, lkrubner@geocities.com (lawrence) wrote: >> >> >How do I get PHP to tell the server that when I echo text to the >> >screen, I need for the text to be sent as UTF-8? >> >> Sent a content-type header with a charset attribute. >> >>> How does Apache know >>>the right encoding when all the text is being generated by PHP? >> >> It doesn't, nor does it need to - that information's just for the end user. > >I'm not sure if I follow you here. Yes, the information is for the end >user, or rather, the web browser (or other ua) that the end user is >using. But something has to send that information out from the >webserver. Normally Apache has some idea what it is dealing with, and >sends some kind of info, yes? It may send Content-type determined by the MIME type for the extension, or looked up through mime-magic, but it generally doesn't know character set, and to my knowledge Apache itself won't send the character set part of the header itself - it just sends 'data' in a character-set agnostic way. You can set it up so that Apache sends a character set header with content negotiation settings, though, but you need to provide the server with more information in that case. >A weaker solution is send a meta >http-equiv tag specifying the charset. But something somewhere has to >send that info. If the web server has no way to know the charset >because all the characters are being generated by PHP, the PHP should >send a charset header, yes? Yes. There's an option in php.ini as to which character set to default to - I think the default default is iso8859-1. (Although really ought to be iso8859-15 due to the Euro). >By the way, in general, when you use echo or print in PHP, what is the >charset of the text being generated? Raw ASCII? (ASCII only goes up to 127) Depends what Content-type header has been sent as to how the output is interpreted. PHP won't do any conversion from the binary representation of anything output, it's just sent as-is. (It might be image data, for example, if you've sent an image/jpeg content-type header.) >> >I'm doing this so I can output to XML without getting errors about >> >"You should not sent plain text". >> >> Don't know what you mean here. XML content doesn't have to be UTF-8 encoded, >> just properly escaped and the encoding set correctly. > >Let's put it this way. Right now users can input whatever the hell >they want. Sometimes they write an essay in Microsoft Word and then >copy and paste the text to the input form, and input that as a weblog >entry. That post then gets added to the RSS feed for that weblog. At >first I tried to write my RSS output using Plain Text, but most >validators throw an error at that (all but radioland's). So I need to >give it a charset. So I decided to give all outgoing XML the charset >of UTF-8. Then I immediately started getting errors because lots of >users had input stuff that was not UTF-8. So what I need to do is take >all input and cast it to UTF-8. If that happens to change some >characters to garbage characters, that is fine - that throws the >problem back at the user, which is where I want it. I merely need to >let them see that they are being idiots. I'll tell them they need to >save any text from Microsoft Word as plain text. Once they start doing >that, then they won't get garbage characters and the software will >output valid XML and RSS. OK, but might have a piece of the puzzle missing here - you need to determine what character set the user posted in in the first place, since it's impossible to convert from an encoding of one character set to an encoding of another one without knowing what the first character set encoding was. I *think* form data is always in the character set of the page containing the original form. I haven't got a reference to back that up, though. I also seem to recall that some browsers (e.g. IE) will send HTML entity encoded versions of characters pasted into a form whose character set does not support them; e.g. Chinese characters into an iso8859-15 form turn up in their &#xxxx; representation in the data. Once you know that, then the mbstring extension has a function for converting between encodings. -- Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk> <http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool |
|
|||
|
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 lawrence wrote: > "Tony Marston" <tony@NOSPAM.demon.co.uk> wrote in message > news:<ci279j$b7s$1$830fa795@news.demon.co.uk>... >> try header('content-type:text/html; charset=UTF-8'); > > The only difference I see in what you wrote is that "content" starts > with a lower case "c". Are you saying these headers are case > sensitive? Hi, No, the difference between your code and Mr. Marston's is that yours uses a colon after the word "charset" and his uses an equals sign. The equals sign is correct. Chris -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBTclkgxSrXuMbw1YRAsXeAKC7qga5M8usyxZ2cmxLPP BEyIkTXwCeNVUx 2R2Q7V9CuD+wDWIpWfIcBLQ= =mhr2 -----END PGP SIGNATURE----- |
|
|||
|
Andy Hassall <andy@andyh.co.uk> wrote in message news:<p4eok059t03jj84ssu1n6tkgped5dfijhv@4ax.com>. ..
> It may send Content-type determined by the MIME type for the extension, or > looked up through mime-magic, but it generally doesn't know character set, and > to my knowledge Apache itself won't send the character set part of the header > itself - it just sends 'data' in a character-set agnostic way. > > You can set it up so that Apache sends a character set header with content > negotiation settings, though, but you need to provide the server with more > information in that case. > > >A weaker solution is send a meta > >http-equiv tag specifying the charset. But something somewhere has to > >send that info. If the web server has no way to know the charset > >because all the characters are being generated by PHP, the PHP should > >send a charset header, yes? > > Yes. There's an option in php.ini as to which character set to default to - I > think the default default is iso8859-1. (Although really ought to be iso8859-15 > due to the Euro). Okay, I don't get this at all. What sends the character encoding information? If you have a set of static HTML files sitting on a server, what is responsible for sending the character encoding? If I, as a web-designer, am not supposed to use http-equiv meta tags, because they are weak, then the information is not inside of the HTML file. So the information needs to be outside of the HMTL file. And what is outside of the HTML file? If Apache remains agnostic about character encoding, then at what point does character encoding get sent? Where is the information stored, and how is it sent out to web browsers? Every character has an encoding by default, right? If no encoding is given, then there are a series of possible defaults, right? An Apache server may have a default, or PHP may have a default encoding set in the php.ini file, right? If not default is set anywhere then the characters are basically raw text, right? In other words, ASCII? Or do I have it all wrong? > >> >I'm doing this so I can output to XML without getting errors about > >> >"You should not sent plain text". > >> > >> Don't know what you mean here. XML content doesn't have to be UTF-8 encoded, > >> just properly escaped and the encoding set correctly. Sorry, I meant RSS. Most RSS validators throw an error if you try to set up an RSS feed using plain text. > >Let's put it this way. Right now users can input whatever the hell > >they want. Sometimes they write an essay in Microsoft Word and then > >copy and paste the text to the input form, and input that as a weblog > >entry. That post then gets added to the RSS feed for that weblog. At > >first I tried to write my RSS output using Plain Text, but most > >validators throw an error at that (all but radioland's). So I need to > >give it a charset. So I decided to give all outgoing XML the charset > >of UTF-8. Then I immediately started getting errors because lots of > >users had input stuff that was not UTF-8. So what I need to do is take > >all input and cast it to UTF-8. If that happens to change some > >characters to garbage characters, that is fine - that throws the > >problem back at the user, which is where I want it. I merely need to > >let them see that they are being idiots. I'll tell them they need to > >save any text from Microsoft Word as plain text. Once they start doing > >that, then they won't get garbage characters and the software will > >output valid XML and RSS. > > OK, but might have a piece of the puzzle missing here - you need to determine > what character set the user posted in in the first place, since it's impossible > to convert from an encoding of one character set to an encoding of another one > without knowing what the first character set encoding was. > > I *think* form data is always in the character set of the page containing the > original form. I haven't got a reference to back that up, though. Yes, we had quite a conversation about that over on another newsgroup. It was quite informative. You can read it here, if you've any interest: http://groups.google.com/groups?hl=e...%3D10%26sa%3DN |
|
|||
|
On 21 Sep 2004 11:30:45 -0700, lkrubner@geocities.com (lawrence) wrote:
>Andy Hassall <andy@andyh.co.uk> wrote in message news:<p4eok059t03jj84ssu1n6tkgped5dfijhv@4ax.com>. .. >> It may send Content-type determined by the MIME type for the extension, or >> looked up through mime-magic, but it generally doesn't know character set, and >> to my knowledge Apache itself won't send the character set part of the header >> itself - it just sends 'data' in a character-set agnostic way. >> >> You can set it up so that Apache sends a character set header with content >> negotiation settings, though, but you need to provide the server with more >> information in that case. >> >> >A weaker solution is send a meta >> >http-equiv tag specifying the charset. But something somewhere has to >> >send that info. If the web server has no way to know the charset >> >because all the characters are being generated by PHP, the PHP should >> >send a charset header, yes? >> >> Yes. There's an option in php.ini as to which character set to default to - I >> think the default default is iso8859-1. (Although really ought to be iso8859-15 >> due to the Euro). > >Okay, I don't get this at all. What sends the character encoding >information? If you have a set of static HTML files sitting on a >server, what is responsible for sending the character encoding? Done a bit more digging, and there's this in my httpd.conf: # # Specify a default charset for all pages sent out. This is # always a good idea and opens the door for future internationalisation # of your web site, should you ever want it. Specifying it as # a default does little harm; as the standard dictates that a page # is in iso-8859-1 (latin1) unless specified otherwise i.e. you # are merely stating the obvious. There are also some security # reasons in browsers, related to javascript and URL parsing # which encourage you to always set a default char set. # AddDefaultCharset ISO-8859-1 OK, so Apache sends out a character set heading under the recommended configuration - although it's effectively hardcoded; it doesn't 'detect' the encoding of the file since that's basically impossible in isolation. To get Apache to send out a character set header for a specific file, you'd then need to use Apache content negotiation if you wanted to select a different character set for a particular file - either with a type-map or I believe it can base it off suffixes of the filename (index.html.iso8859-p15 and so on). Consider the following response from Apache: andyh@server:~/public_html$ touch utf8.html.utf8 andyh@server:~/public_html$ telnet localhost 80 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. HEAD /~andyh/utf8.html HTTP/1.0 HTTP/1.1 200 OK Date: Tue, 21 Sep 2004 19:19:03 GMT Server: Apache/2.0.51 (Unix) PHP/5.0.1 DAV/2 SVN/1.0.6 Content-Location: utf8.html.utf8 Vary: negotiate TCN: choice Last-Modified: Tue, 21 Sep 2004 19:18:47 GMT ETag: "3811f-0-7f9b93c0;7f9b93c0" Accept-Ranges: bytes Connection: close Content-Type: text/html; charset=utf-8 Connection closed by foreign host. OK - so a filename of utf8.html.utf8 means that a request for utf8.html comes out in utf8 encoding. (I've got content negotiation enabled on my server). Presumably in the case of multiple encodings for the same URI then the browser's Accept-charset header comes into play for Apache to pick which to serve. > If I, >as a web-designer, am not supposed to use http-equiv meta tags, >because they are weak, then the information is not inside of the HTML >file. So the information needs to be outside of the HMTL file. And >what is outside of the HTML file? If Apache remains agnostic about >character encoding, then at what point does character encoding get >sent? Where is the information stored, and how is it sent out to web >browsers? Either a type map, or encoded in the filename. (can't speak for other servers apart from Apache). >Every character has an encoding by default, right? If no encoding is >given, then there are a series of possible defaults, right? An Apache >server may have a default, or PHP may have a default encoding set in >the php.ini file, right? Right. > If not default is set anywhere then the >characters are basically raw text, right? In other words, ASCII? Ah, but even ASCII isn't raw text, depending on your definition of raw - it's the ASCII encoding of a small-ish character set. 'Binary' is the usual definition of completely raw data - it's just a stream of bytes with no defined correspondence to characters. As to what the default in HTTP is - time to dig out the HTTP standards. RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1 <ftp://ftp.isi.edu/in-notes/rfc2616.txt> " 3.4.1 Missing Charset Some HTTP/1.0 software has interpreted a Content-Type header without charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly with an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1. " " 3.7.1 Canonicalization and Text Defaults [...] The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems. " OK - so we officially default to ISO-8859-1, at least for text/* content types, which is a superset of ASCII, but definitely a well-defined character set and not just a raw stream of bytes. Makes sense. >Or do I have it all wrong? Definitely sounds like you've got the idea. >> >> >I'm doing this so I can output to XML without getting errors about >> >> >"You should not sent plain text". >> >> >> >> Don't know what you mean here. XML content doesn't have to be UTF-8 encoded, >> >> just properly escaped and the encoding set correctly. > >Sorry, I meant RSS. Most RSS validators throw an error if you try to >set up an RSS feed using plain text. Oh, is this just a case of the wrong Content-type though - text/plain or text/html vs. text/xml or whatever it is? [snip] >> I *think* form data is always in the character set of the page containing the >> original form. I haven't got a reference to back that up, though. > >Yes, we had quite a conversation about that over on another newsgroup. >It was quite informative. You can read it here, if you've any >interest: > >http://groups.google.com/groups?hl=e...%3D10%26sa%3DN Hm - Netscape 4 as ever is a complete mess then! Does anyone actually use NN4 any more? It's well past time it was blasted out of existence - does it do _anything_ right? -- Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk> <http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool |