This is a discussion on how to tell server from PHP that charset is UTF-8?? within the PHP Language forums, part of the PHP Programming Forums category; Andy Hassall <andy@andyh.co.uk> wrote: > OK - so we officially default to ISO-8859-1, at ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
Andy Hassall <andy@andyh.co.uk> wrote:
> OK - so we officially default to ISO-8859-1, at least for text/* content > types, which is a superset of ASCII, but definitely a well-defined character > set and not just a raw stream of bytes. Makes sense. Completely true... almost. text/html has unicode as it characterset accoding to w3c[1], the charset header is nothing more than the encoding used to transport the data. iso-8859-1 is the best choice if you need upto the first 256 characters in unicode. If one needs more characters the utf-x encodings should be used. [1] http://www.w3.org/TR/html401/charset.html -- Daniel Tryba |
|
|||
|
Daniel Tryba wrote:
> Andy Hassall <andy@andyh.co.uk> wrote: > > OK - so we officially default to ISO-8859-1, at least for text/* content > > types, which is a superset of ASCII, but definitely a well-defined character > > set and not just a raw stream of bytes. Makes sense. > > Completely true... almost. text/html has unicode as it characterset > accoding to w3c[1], 'Character set', with or without a space, breeds confusion. http://www.w3.org/MarkUp/html-spec/charset-harmful.html If by 'characterset' you meant HTML4.01's document character set, you're right. But HTML's document character set is unrelated to this discussion. If however you meant character encoding, you're wrong, because any encoding is allowed. Did you mean something else? RFC2854 sec. 6 lists sources that specify the default when a text/html document is served without explicitly declaring its character encoding. Despite RFC2616 defining text/*'s default character encoding as ISO-8859-1, HTML4.01 conforming user-agents mustn't assume any default value: 'The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO- 8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.' (HTML4.01 sec. 5.2.2.) So it'd be absurd to heed the advice given in RFC2616 sec. 19.3, which says that 'not labelling the entity is preferred over labelling the entity with the labels US-ASCII or ISO- 8859-1'. The usual ciwa* recommendation stands, discord notwithstanding: send a charset parameter. [ ... ] Roll on the weekend! -- Jock |
|
|||
|
Andy Hassall <andy@andyh.co.uk> wrote in message news:<36v0l0d9sm2t2f1e0n9s51f3ajc692boda@4ax.com>. ..
> OK, so Apache sends out a character set heading under the recommended > configuration - although it's effectively hardcoded; it doesn't 'detect' the > encoding of the file since that's basically impossible in isolation. > > To get Apache to send out a character set header for a specific file, you'd > then need to use Apache content negotiation if you wanted to select a different > character set for a particular file - either with a type-map or I believe it > can base it off suffixes of the filename (index.html.iso8859-p15 and so on). > > Consider the following response from Apache: > > andyh@server:~/public_html$ touch utf8.html.utf8 > andyh@server:~/public_html$ telnet localhost 80 > Trying 127.0.0.1... > Connected to localhost. > Escape character is '^]'. > HEAD /~andyh/utf8.html HTTP/1.0 > > HTTP/1.1 200 OK > Date: Tue, 21 Sep 2004 19:19:03 GMT > Server: Apache/2.0.51 (Unix) PHP/5.0.1 DAV/2 SVN/1.0.6 > Content-Location: utf8.html.utf8 > Vary: negotiate > TCN: choice > Last-Modified: Tue, 21 Sep 2004 19:18:47 GMT > ETag: "3811f-0-7f9b93c0;7f9b93c0" > Accept-Ranges: bytes > Connection: close > Content-Type: text/html; charset=utf-8 > > Connection closed by foreign host. > > OK - so a filename of utf8.html.utf8 means that a request for utf8.html comes > out in utf8 encoding. (I've got content negotiation enabled on my server). > > Presumably in the case of multiple encodings for the same URI then the > browser's Accept-charset header comes into play for Apache to pick which to > serve. That's very interesting. Thanks for doing that bit of digging. I'm sorry to say I've temporarily been handed responsibility for keeping an Apache server going, though I don't know much about Apache. We're hosting about 30 different domains on this machine. Most of those domains have individuals who are handling all the web design for that domain. If I set a default charset for Apache, how do the individual web designers override the decision, if they need to? An ..htaccess file? http-equiv meta tags? Just curious. |