This is a discussion on Using DOM textContent Property within the PHP General forums, part of the PHP Programming Forums category; Hello, I am writing a filter in PHP that takes some HTML as input and goes through the HTML and ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
Hello,
I am writing a filter in PHP that takes some HTML as input and goes through the HTML and adjusts certain tag attributes as needed. So, for example, if <a> tag is missing the "title" attribute, this filter adds a title attribute to the <a> tag. I'm doing this all using PHP 5 and the DOM parsing library, and it's working really well. The one snafu I'm running in to is dealing with users who will just type an e-mail address into an HTML document without actually making it a link - so, they'll just put foo@bar.com rather than <a href="mailto:foo@bar.com">foo@bar.com</a>. I'd like for these incorrectly entered e-mail addresses to magically change into real clickable links, so I'd like my filter to be able to grab those plain text e-mail addresses and convert them to actual clickable links. I tried iterating through all the elements on a page using something like this: $Elements = $HTML->getElementsByTagName("*"); for ($X = 0; $X < $Elements->length; $X++) { ... SNIP ... } And then I tried looking at the textContent property of each node, but it seems that higher-level nodes include all the text of their children nodes (which is what the DOM documents say it should). But there doesn't appear to be any way to know if the textContent you've got is for just one node, or for a whole bunch of nodes. Is there any way to figure that out, so that I can adjust the textContent property of just the lowest-level nodes, rather than mucking up the higher-level ones? Tim Gustafson SOE Webmaster UC Santa Cruz tjg@soe.ucsc.edu 831-459-5354 |
|
|||
|
On Tue, Sep 2, 2008 at 3:18 PM, Tim Gustafson <tjg@soe.ucsc.edu> wrote:
> And then I tried looking at the textContent property of each node, but it > seems that higher-level nodes include all the text of their children nodes > (which is what the DOM documents say it should). But there doesn't appear > to be any way to know if the textContent you've got is for just one node, > or > for a whole bunch of nodes. Is there any way to figure that out, so that I > can adjust the textContent property of just the lowest-level nodes, rather > than mucking up the higher-level ones? <http://www.php.net/unsub.php> > if a node has children, then its not a leaf, so i imagine you could continue to traverse until you reach the leaf that actually has the address needing magical conversion.. also, for a performance increase, if you dont find a match at a high level, you could skip that entire sub-section of the tree; no need to go down to a leaf if you know theres no magic needed for the current branch :) -nathan |
|
|||
|
> if a node has children, then its not a leaf, so i imagine
> you could continue to traverse until you reach the leaf > that actually has the address needing magical conversion. I tried that. $Element->hasChildNodes() returns true for just about everything except tags like <br> and <img> that have no corresponding </br> or </img> because the content that appears between <p> and </p>, for example, apparently counts as a child node, even though they're not HTML tags. So, if you have: <p>Foo!</p> when you look at $Element->hasChildNodes() for the <p> tag, you will get "true", and $Element->childNodes->length is equal to "1", even though "Foo!" isn't an HTML tag. Interestingly though, when you iterate through the tree, you get the <p> tag as one of the elements, but you never get a text-only element that has that <p> as a parentNode. In fact, get_class($Element) always returns DOMElement, even on the text-only nodes, which I would have expected to be DOMText elements...but I guess not. So I'm wondering why $Element->hasChildNodes() would return true, but iterating through the DOM tree returns no elements that have that $Element as a parentNode. What's more, looking at $Element->childNodex->length isn't too helpful, because, for example: <h2><a name="bar"></a>Foo</h2> returns two child nodes, neither of which has "Foo" for its textContent. Tim Gustafson SOE Webmaster UC Santa Cruz tjg@soe.ucsc.edu 831-459-5354 |
|
|||
|
Tim Gustafson a écrit :
> $Elements = $HTML->getElementsByTagName("*"); > > for ($X = 0; $X < $Elements->length; $X++) { > ... SNIP ... > } Why don't use the XPath ? <http://fr.php.net/manual/en/class.domxpath.php> <http://fr.php.net/manual/en/domxpath.query.php> This query fetch all a elements with no title attribute or empty title attribute : '//a[not(@title) or @title = ""]' ; -- Mickaël Wolff aka Lupus Michaelis http://lupusmic.org |
|
|||
|
Tim Gustafson wrote:
> Hello, > > I am writing a filter in PHP that takes some HTML as input and goes through > the HTML and adjusts certain tag attributes as needed. So, for example, if > <a> tag is missing the "title" attribute, this filter adds a title attribute > to the <a> tag. > > I'm doing this all using PHP 5 and the DOM parsing library, and it's working > really well. > > The one snafu I'm running in to is dealing with users who will just type an > e-mail address into an HTML document without actually making it a link - so, > they'll just put foo@bar.com rather than <a > href="mailto:foo@bar.com">foo@bar.com</a>. I'd like for these incorrectly > entered e-mail addresses to magically change into real clickable links, so > I'd like my filter to be able to grab those plain text e-mail addresses and > convert them to actual clickable links. > > I tried iterating through all the elements on a page using something like > this: > > $Elements = $HTML->getElementsByTagName("*"); > > for ($X = 0; $X < $Elements->length; $X++) { > ... SNIP ... > } > I think you might be better off using regexp on the text *before* sending it through the DOM parser. Send the user's text through a function that searches for URLs and email addresses, creating proper links as they're found, then use the output from that to move on to your DOM stuff. That way, you need not create new nodes in your nodelist. |
|
|||
|
Lupus Michaelis wrote:
> Tim Gustafson a écrit : > >> $Elements = $HTML->getElementsByTagName("*"); >> >> for ($X = 0; $X < $Elements->length; $X++) { >> ... SNIP ... >> } > > Why don't use the XPath ? > <http://fr.php.net/manual/en/class.domxpath.php> > <http://fr.php.net/manual/en/domxpath.query.php> > > This query fetch all a elements with no title attribute or empty title > attribute : '//a[not(@title) or @title = ""]' ; > > That example was for finding email addresses and turning them into links, not the other thing about adding missing attributes. XPATH would be no help with the former. |
|
|||
|
> I think you might be better off using regexp on the text
> *before* sending it through the DOM parser. Send the > user's text through a function that searches for URLs > and email addresses, creating proper links as they're > found, then use the output from that to move on to your > DOM stuff. That way, you need not create new nodes in > your nodelist. I think that's the way I'm going to have to go, but I was really hoping not to. Thanks for the suggestion! Tim Gustafson SOE Webmaster UC Santa Cruz tjg@soe.ucsc.edu 831-459-5354 |
|
|||
|
php@logi.ca a écrit :
> That example was for finding email addresses and turning them into > links, not the other thing about adding missing attributes. XPATH would > be no help with the former. You're right, I misunderstood :-/ sorry for the noise. -- Mickaël Wolff aka Lupus Michaelis http://lupusmic.org |
|
|||
|
On Wed, Sep 3, 2008 at 10:03 AM, Tim Gustafson <tjg@soe.ucsc.edu> wrote:
> > I think you might be better off using regexp on the text > > *before* sending it through the DOM parser. Send the > > user's text through a function that searches for URLs > > and email addresses, creating proper links as they're > > found, then use the output from that to move on to your > > DOM stuff. That way, you need not create new nodes in > > your nodelist. > > I think that's the way I'm going to have to go, but I was really hoping not > to. Thanks for the suggestion! i think i have what youre looking for Tim, take a look at this script output nathan@devel ~ $ php testDom.php IN: <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body>Test<br/><h2>quickshiftin@gmail.com<a name="bar">stuff inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html> OUT: <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body>Test<br/><h2><a href="mailto:quickshiftin@gmail.com"> quickshiftin@gmail.com</a><a name="bar">stuff inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html> and heres the code using the DOM extension you may have to tweak it to suit your needs, but currently i think it does the trick ;) <?php $doc = new DOMDocument(); $doc->loadHTML('<html><body>Test<br><h2>quickshiftin@gm ail.com<a name="bar">stuff inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>'); echo 'IN:' . PHP_EOL . $doc->saveXML() . PHP_EOL; findTextNodes($doc->getElementsByTagName('*'), 'convertToLinkIfNecc'); echo 'OUT: ' . PHP_EOL . $doc->saveXML() . PHP_EOL; /** * run through a DOMNodeList, looking for text nodes. apply a callback to * all such text nodes that are encountered */ function findTextNodes(DOMNodeList $nodesToSearch, $callback) { foreach($nodesToSearch as $curNode) { if($curNode->hasChildNodes()) foreach($curNode->childNodes as $curChild) if($curChild instanceof DOMText) #echo "TEXT NODE FOUND: " . $curChild->nodeValue . PHP_EOL; /// todo: allow use of hook here call_user_func($callback, $curNode, $curChild); } } /** * determine if a node should be modified, by chcking to see if a child is a text node * and the text looks like an email address. * call a subordinate function to convert the text node into a mailto anchor DOMElement */ function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) { if( (strtolower($textContainer->nodeName) != 'a') && (filter_var($textNode->nodeValue, FILTER_VALIDATE_EMAIL) !== false) ) { convertMailtoToAnchor($textContainer, $textNode); } } /** * modify a DOMElement that has a DOMText node as a child; create a DOMElement * that represents and a tag, and set the value and href attirbute, so that it * acts as a 'mailto' link */ function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode) { $newNode = new DomElement('a', $textNode->nodeValue); $textContainer->replaceChild($newNode, $textNode); $newNode->setAttribute('href', "mailto:{$textNode->nodeValue}"); } -nathan |
|
|||
|
Nathan,
Thanks for the suggestion, but it's still not working for me. Here's my code: =========== $HTML = new DOMDocument(); @$HTML->loadHTML($text); $Elements = $HTML->getElementsByTagName("*"); for ($X = 0; $X < $Elements->length; $X++) { $Element = $Elements->item($X); if ($Element->tagName == "a") { # SNIP - Do something with A tags here } else if ($Element instanceof DOMText) { echo $Element->nodeValue; exit; } } =========== This loop never executes the instanceof part of the code. If I add: } else if ($Element instanceof DOMNode) { echo "foo!"; exit; } Then it echos "foo!" as expected. It just seems that none of the nodes in the tree are DOMText nodes. In fact, get_class($Element) returns "DOMElement" for every node in the tree. Tim Gustafson SOE Webmaster UC Santa Cruz tjg@soe.ucsc.edu 831-459-5354 ________________________________ From: Nathan Nobbe [mailto:quickshiftin@gmail.com] Sent: Wednesday, September 03, 2008 11:55 AM To: Tim Gustafson Cc: php@logi.ca; php-general@lists.php.net Subject: Re: [php] Using DOM textContent Property On Wed, Sep 3, 2008 at 10:03 AM, Tim Gustafson <tjg@soe.ucsc.edu> wrote: > I think you might be better off using regexp on the text > *before* sending it through the DOM parser. Send the > user's text through a function that searches for URLs > and email addresses, creating proper links as they're > found, then use the output from that to move on to your > DOM stuff. That way, you need not create new nodes in > your nodelist. I think that's the way I'm going to have to go, but I was really hoping not to. Thanks for the suggestion! i think i have what youre looking for Tim, take a look at this script output nathan@devel ~ $ php testDom.php IN: <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body>Test<br/><h2>quickshiftin@gmail.com<a name="bar">stuff inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html> OUT: <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body>Test<br/><h2><a href="mailto:quickshiftin@gmail.com">quickshiftin@ gmail.com</a><a name="bar">stuff inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html> and heres the code using the DOM extension you may have to tweak it to suit your needs, but currently i think it does the trick ;) <?php $doc = new DOMDocument(); $doc->loadHTML('<html><body>Test<br><h2>quickshiftin@gm ail.com<a name="bar">stuff inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>'); echo 'IN:' . PHP_EOL . $doc->saveXML() . PHP_EOL; findTextNodes($doc->getElementsByTagName('*'), 'convertToLinkIfNecc'); echo 'OUT: ' . PHP_EOL . $doc->saveXML() . PHP_EOL; /** * run through a DOMNodeList, looking for text nodes. apply a callback to * all such text nodes that are encountered */ function findTextNodes(DOMNodeList $nodesToSearch, $callback) { foreach($nodesToSearch as $curNode) { if($curNode->hasChildNodes()) foreach($curNode->childNodes as $curChild) if($curChild instanceof DOMText) #echo "TEXT NODE FOUND: " . $curChild->nodeValue .. PHP_EOL; /// todo: allow use of hook here call_user_func($callback, $curNode, $curChild); } } /** * determine if a node should be modified, by chcking to see if a child is a text node * and the text looks like an email address. * call a subordinate function to convert the text node into a mailto anchor DOMElement */ function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) { if( (strtolower($textContainer->nodeName) != 'a') && (filter_var($textNode->nodeValue, FILTER_VALIDATE_EMAIL) !== false) ) { convertMailtoToAnchor($textContainer, $textNode); } } /** * modify a DOMElement that has a DOMText node as a child; create a DOMElement * that represents and a tag, and set the value and href attirbute, so that it * acts as a 'mailto' link */ function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode) { $newNode = new DomElement('a', $textNode->nodeValue); $textContainer->replaceChild($newNode, $textNode); $newNode->setAttribute('href', "mailto:{$textNode->nodeValue}"); } -nathan |