This is a discussion on htmlentities & charencoding within the PHP Language forums, part of the PHP Programming Forums category; Hi all, I was hoping to get some clarification on a couple of questions I have: 1) When should htmlspecial ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
Hi all,
I was hoping to get some clarification on a couple of questions I have: 1) When should htmlspecial characters be used? As a general rule should it be used for text that may contain special characters that is going to be rendered in the browser (ie: text that isn't in tags)? I've got a javascript onclick handler whose code includes an ampersand and the HTML validator complains. I don't know if I should escape the ampersand, or even if its possible (seeing that the text is inside a HTML attribute). Why would you ever use htmlentities as opposed to htmlspecialchars? The only reason I can think of is if you're page's charset doesn't support the special character you're trying to render (for example, the euro using Latin1), but then why wouldn't you just change the pages charset to UTF-8 (unless you're editor can't save in UTF-8, which might indicate its time to get another editor). The comment on the PHP manual entry for html entities, 'Please, don't use htmlentities to avoid XSS! Htmlspecialchars is enough!' seems to suggest that the uses for htmlentities is limited, since it needn't be used to avoid XSS. 2) A comment in the PHP manual entry for htmlentities states that their function can be used to 'replace any characters in a string that could be 'dangerous' to put in an HTML/XML file with their numeric entities (e.g. é for [e acute])'. Why would it be dangerous!? 3) What are some typical uses of specifying HTTP input/output character encoding? If it is used to convert output, why wouldn't you just change the output page's char encoding? If its used to convert input from say UTF-8 to Latin1, couldn't you just use a function to do this? That's about it! Thanks in advance Taras |
|
|||
|
Taras_96 wrote: > Hi all, > > I was hoping to get some clarification on a couple of questions I have: > > 1) When should htmlspecial characters be used? As a general rule should > it be used for text that may contain special characters that is going > to be rendered in the browser (ie: text that isn't in tags)? I've got a > javascript onclick handler whose code includes an ampersand and the > HTML validator complains. I don't know if I should escape the > ampersand, or even if its possible (seeing that the text is inside a > HTML attribute). > Well.. bascially your either saying show this image to the user "copyrightsymbol" OR giving an instruction to the browser to display a copyright symbol. I think the "dangerous" comment comes from the fact that often MS will simply blank sometimes when they will display correctly in *nix or when an undefined notation is used in a page is it not known what the effect will be on some platforms or how it will be displayed. Flamer. > Why would you ever use htmlentities as opposed to htmlspecialchars? The > only reason I can think of is if you're page's charset doesn't support > the special character you're trying to render (for example, the euro > using Latin1), but then why wouldn't you just change the pages charset > to UTF-8 (unless you're editor can't save in UTF-8, which might > indicate its time to get another editor). The comment on the PHP manual > entry for html entities, 'Please, don't use htmlentities to avoid XSS! > Htmlspecialchars is enough!' seems to suggest that the uses for > htmlentities is limited, since it needn't be used to avoid XSS. > > 2) A comment in the PHP manual entry for htmlentities states that their > function can be used to 'replace any characters in a string that could > be 'dangerous' to put in an HTML/XML file with their numeric entities > (e.g. é for [e acute])'. Why would it be dangerous!? > > 3) What are some typical uses of specifying HTTP input/output character > encoding? If it is used to convert output, why wouldn't you just change > the output page's char encoding? If its used to convert input from say > UTF-8 to Latin1, couldn't you just use a function to do this? > > That's about it! > > Thanks in advance > > Taras |
|
|||
|
Message-ID: <1152576115.197347.115450@35g2000cwc.googlegroups. com> from
Taras_96 contained the following: >1) When should htmlspecial characters be used? As a general rule should >it be used for text that may contain special characters that is going >to be rendered in the browser (ie: text that isn't in tags)? I've got a >javascript onclick handler whose code includes an ampersand and the >HTML validator complains. The people without javascript will complain too, when they can't navigate your site. Just change the ampersand for & -- Geoff Berrow (put thecat out to email) It's only Usenet, no one dies. My opinions, not the committee's, mine. Simple RFDs http://www.ckdog.co.uk/rfdmaker/ |
|
|||
|
Taras_96 wrote:
> Hi all, > > I was hoping to get some clarification on a couple of questions I have: > > 1) When should htmlspecial characters be used? As a general rule should > it be used for text that may contain special characters that is going > to be rendered in the browser (ie: text that isn't in tags)? I've got a > javascript onclick handler whose code includes an ampersand and the > HTML validator complains. I don't know if I should escape the > ampersand, or even if its possible (seeing that the text is inside a > HTML attribute). > Well, I haven't looked at the code, but I suspect htmlspecialchars(), since it converts fewer characters and has fewer options, it would be faster. The HTML validator on w3.org is decent, but it doesn't handle javascript very well. I just ignore the errors in javascript; for instance, something like: j=4&i; The "&i" is not a valid html entity - but it's valid javascript code. And this javascript wouldn't work: j = 4%amp;i; > Why would you ever use htmlentities as opposed to htmlspecialchars? The > only reason I can think of is if you're page's charset doesn't support > the special character you're trying to render (for example, the euro > using Latin1), but then why wouldn't you just change the pages charset > to UTF-8 (unless you're editor can't save in UTF-8, which might > indicate its time to get another editor). The comment on the PHP manual > entry for html entities, 'Please, don't use htmlentities to avoid XSS! > Htmlspecialchars is enough!' seems to suggest that the uses for > htmlentities is limited, since it needn't be used to avoid XSS. > Just changing the page charset doesn't change what PHP uses. You can pass a charset to either function, but if you need more than the five chars handled by htmlspecialchars() you need to use htmlentities(). And the notes are comments - from users, not the PHP developers. I give it some credence, but not as much as the "official" word from the PHP developers. And if you look through them enough, you'll find errors and other people who get in and correct the errors. Not that much different than what you find here on usenet. > 2) A comment in the PHP manual entry for htmlentities states that their > function can be used to 'replace any characters in a string that could > be 'dangerous' to put in an HTML/XML file with their numeric entities > (e.g. é for [e acute])'. Why would it be dangerous!? > Don't know here, but I suspect browsers may act differently in different languages. But I have enough trouble with my native language, so I really haven't worried about it. But again that's a user comment. > 3) What are some typical uses of specifying HTTP input/output character > encoding? If it is used to convert output, why wouldn't you just change > the output page's char encoding? If its used to convert input from say > UTF-8 to Latin1, couldn't you just use a function to do this? > I use it anytime I'm displaying data input by the user, read from a database, etc. You never know when the data might contain a '<', a '"', etc. Changing the char encoding for the page doesn't convert any characters. All it does is tell the browser how to handle the characters. It's up to you, the programmer, to ensure the character encoding you use matches that of the page. > That's about it! > > Thanks in advance > > Taras > -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|||
|
On 2006-07-11 21:52:53 +1000, Jerry Stuckle <jstucklex@attglobal.net> said:
> Taras_96 wrote: >> Hi all, >> >> I was hoping to get some clarification on a couple of questions I have: >> >> 1) When should htmlspecial characters be used? As a general rule should >> it be used for text that may contain special characters that is going >> to be rendered in the browser (ie: text that isn't in tags)? I've got a >> javascript onclick handler whose code includes an ampersand and the >> HTML validator complains. I don't know if I should escape the >> ampersand, or even if its possible (seeing that the text is inside a >> HTML attribute). >> > > Well, I haven't looked at the code, but I suspect htmlspecialchars(), > since it converts fewer characters and has fewer options, it would be > faster. > > The HTML validator on w3.org is decent, but it doesn't handle > javascript very well. I just ignore the errors in javascript; for > instance, something like: > > j=4&i; > > The "&i" is not a valid html entity - but it's valid javascript code. > And this javascript wouldn't work: > > j = 4%amp;i; No, it wouldn't, but valid XHTML _requires_ you to preclude the embedded JavaScript with the appropriate CDATA marker. The character '&' is reserved by the markup just like '>' and '<'. Not adhering to the outlined standards simply encourages bad markup and makes cross-browser compatibility more difficult. It's a big stretch to equate cross-browser issues with unencoded ampersands, but it's not that difficult to deal with. Javascript has some functional string methods for encoding HTML entities. > > >> Why would you ever use htmlentities as opposed to htmlspecialchars? The >> only reason I can think of is if you're page's charset doesn't support >> the special character you're trying to render (for example, the euro >> using Latin1), but then why wouldn't you just change the pages charset >> to UTF-8 (unless you're editor can't save in UTF-8, which might >> indicate its time to get another editor). The comment on the PHP manual >> entry for html entities, 'Please, don't use htmlentities to avoid XSS! >> Htmlspecialchars is enough!' seems to suggest that the uses for >> htmlentities is limited, since it needn't be used to avoid XSS. >> > > Just changing the page charset doesn't change what PHP uses. You can > pass a charset to either function, but if you need more than the five > chars handled by htmlspecialchars() you need to use htmlentities(). > > And the notes are comments - from users, not the PHP developers. I > give it some credence, but not as much as the "official" word from the > PHP developers. And if you look through them enough, you'll find > errors and other people who get in and correct the errors. Not that > much different than what you find here on usenet. > >> 2) A comment in the PHP manual entry for htmlentities states that their >> function can be used to 'replace any characters in a string that could >> be 'dangerous' to put in an HTML/XML file with their numeric entities >> (e.g. é for [e acute])'. Why would it be dangerous!? >> > > Don't know here, but I suspect browsers may act differently in > different languages. But I have enough trouble with my native > language, so I really haven't worried about it. But again that's a > user comment. > >> 3) What are some typical uses of specifying HTTP input/output character >> encoding? If it is used to convert output, why wouldn't you just change >> the output page's char encoding? If its used to convert input from say >> UTF-8 to Latin1, couldn't you just use a function to do this? >> > > I use it anytime I'm displaying data input by the user, read from a > database, etc. You never know when the data might contain a '<', a > '"', etc. > > Changing the char encoding for the page doesn't convert any characters. > All it does is tell the browser how to handle the characters. It's > up to you, the programmer, to ensure the character encoding you use > matches that of the page. > > >> That's about it! >> >> Thanks in advance >> >> Taras |
|
|||
|
Mel wrote:
> On 2006-07-11 21:52:53 +1000, Jerry Stuckle <jstucklex@attglobal.net> said: > >> >> Well, I haven't looked at the code, but I suspect htmlspecialchars(), >> since it converts fewer characters and has fewer options, it would be >> faster. >> >> The HTML validator on w3.org is decent, but it doesn't handle >> javascript very well. I just ignore the errors in javascript; for >> instance, something like: >> >> j=4&i; >> >> The "&i" is not a valid html entity - but it's valid javascript code. >> And this javascript wouldn't work: >> >> j = 4%amp;i; > > > No, it wouldn't, but valid XHTML _requires_ you to preclude the embedded > JavaScript with the appropriate CDATA marker. The character '&' is > reserved by the markup just like '>' and '<'. Not adhering to the > outlined standards simply encourages bad markup and makes cross-browser > compatibility more difficult. It's a big stretch to equate cross-browser > issues with unencoded ampersands, but it's not that difficult to deal > with. Javascript has some functional string methods for encoding HTML > entities. > Who said anything about XHTML? This is straight html. And the point is - this is valid javascript, but the validator on w3.org doesn't recognize it as such. Therefore it spits out errors where there are none. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|||
|
On Tue, 11 Jul 2006 17:36:20 -0400, Jerry Stuckle <jstucklex@attglobal.net>
wrote: >Mel wrote: >> On 2006-07-11 21:52:53 +1000, Jerry Stuckle <jstucklex@attglobal.net> said: >> >>> The HTML validator on w3.org is decent, but it doesn't handle >>> javascript very well. I just ignore the errors in javascript; for >>> instance, something like: >>> >>> j=4&i; >>> >>> The "&i" is not a valid html entity - but it's valid javascript code. >>> And this javascript wouldn't work: >>> >>> j = 4%amp;i; >> >> No, it wouldn't, but valid XHTML _requires_ you to preclude the embedded >> JavaScript with the appropriate CDATA marker. The character '&' is >> reserved by the markup just like '>' and '<'. Not adhering to the >> outlined standards simply encourages bad markup and makes cross-browser >> compatibility more difficult. It's a big stretch to equate cross-browser >> issues with unencoded ampersands, but it's not that difficult to deal >> with. Javascript has some functional string methods for encoding HTML >> entities. > >Who said anything about XHTML? This is straight html. > >And the point is - this is valid javascript, but the validator on w3.org >doesn't recognize it as such. Therefore it spits out errors where there >are none. Yes, this seems to be backed up by HTML 4.01 appendix B.3.2, which even has an example of the contents of a <script> element in VBScript using & as a string concatenation operator. http://www.w3.org/TR/html4/appendix/...pecifying-data It discusses how to avoid accidentally closing the <script> element, but seems to indicate that & doesn't start a character reference inside <script>, as that's automatically CDATA. So validators producing errors in this case would appear to be wrong. However, validator.w3.org currently handles the example given without error. I uploaded the following: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Strict //EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"> <title>Page</title> </head> <body> <script type="text/javascript"> j=4&i; </script> </body> </html> It responded: This Page Is Valid -//W3C//DTD HTML 4.01 Strict //EN! (it also validates as Transitional, unsurprisingly) Has its behaviour changed recently? Did it used to produce errors in this case? The "HTML Tidy" validator as used in the HTML Validator Firefox extension also accepts & within <script> without complaint, and correctly complains about "</" appearing in the script source. -- Andy Hassall :: andy@andyh.co.uk :: http://www.andyh.co.uk http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool |
|
|||
|
Andy Hassall wrote:
> On Tue, 11 Jul 2006 17:36:20 -0400, Jerry Stuckle <jstucklex@attglobal.net> > wrote: > > >>Mel wrote: >> >>>On 2006-07-11 21:52:53 +1000, Jerry Stuckle <jstucklex@attglobal.net> said: >>> >>> >>>>The HTML validator on w3.org is decent, but it doesn't handle >>>>javascript very well. I just ignore the errors in javascript; for >>>>instance, something like: >>>> >>>> j=4&i; >>>> >>>>The "&i" is not a valid html entity - but it's valid javascript code. >>>>And this javascript wouldn't work: >>>> >>>> j = 4%amp;i; >>> >>>No, it wouldn't, but valid XHTML _requires_ you to preclude the embedded >>>JavaScript with the appropriate CDATA marker. The character '&' is >>>reserved by the markup just like '>' and '<'. Not adhering to the >>>outlined standards simply encourages bad markup and makes cross-browser >>>compatibility more difficult. It's a big stretch to equate cross-browser >>>issues with unencoded ampersands, but it's not that difficult to deal >>>with. Javascript has some functional string methods for encoding HTML >>>entities. >> >>Who said anything about XHTML? This is straight html. >> >>And the point is - this is valid javascript, but the validator on w3.org >>doesn't recognize it as such. Therefore it spits out errors where there >>are none. > > > Yes, this seems to be backed up by HTML 4.01 appendix B.3.2, which even has an > example of the contents of a <script> element in VBScript using & as a string > concatenation operator. > > http://www.w3.org/TR/html4/appendix/...pecifying-data > > It discusses how to avoid accidentally closing the <script> element, but seems > to indicate that & doesn't start a character reference inside <script>, as > that's automatically CDATA. So validators producing errors in this case would > appear to be wrong. > > However, validator.w3.org currently handles the example given without error. I > uploaded the following: > > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Strict //EN" > "http://www.w3.org/TR/html4/strict.dtd"> > <html> > <head> > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"> > <title>Page</title> > </head> > <body> > > <script type="text/javascript"> > j=4&i; > </script> > > </body> > </html> > > It responded: > > This Page Is Valid -//W3C//DTD HTML 4.01 Strict //EN! > > (it also validates as Transitional, unsurprisingly) Has its behaviour changed > recently? Did it used to produce errors in this case? > > The "HTML Tidy" validator as used in the HTML Validator Firefox extension also > accepts & within <script> without complaint, and correctly complains about "</" > appearing in the script source. > Andy, They might have fixed it. I hope so. I've had problems with it before. I just ignore any errors within <script> elements. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |