This is a discussion on Using DOM textContent Property within the PHP General forums, part of the PHP Programming Forums category; bouncing back to the list so that others may benefit from our work... On Fri, Sep 5, 2008 at 3:...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
bouncing back to the list so that others may benefit from our work...
On Fri, Sep 5, 2008 at 3:09 PM, Tim Gustafson <tjg@soe.ucsc.edu> wrote: > Nathan, > > Thanks for the suggestion, but it's still not working for me. Here's my > code: > > =========== > $HTML = new DOMDocument(); > @$HTML->loadHTML($text); > $Elements = $HTML->getElementsByTagName("*"); > > for ($X = 0; $X < $Elements->length; $X++) { > $Element = $Elements->item($X); > > if ($Element->tagName == "a") { > # SNIP - Do something with A tags here > } else if ($Element instanceof DOMText) { > echo $Element->nodeValue; exit; > } > } > =========== > > This loop never executes the instanceof part of the code. If I add: > > } else if ($Element instanceof DOMNode) { > echo "foo!"; exit; > } > > Then it echos "foo!" as expected. It just seems that none of the nodes in > the tree are DOMText nodes. In fact, get_class($Element) returns > "DOMElement" for every node in the tree. Tim, i got your code working with minimal effort by pulling in two of the methods i posted and making some revisions. scope it out, (this will produce the same output as my last post (the part after OUT:)) <?php $text = '<html><body>Test<br><h2>quickshiftin@gmail.com<a name="bar">stuff inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>'; $HTML = new DOMDocument(); $HTML->loadHTML($text); $Elements = $HTML->getElementsByTagName("*"); for ($X = 0; $X < $Elements->length; $X++) { $Element = $Elements->item($X); if($Element->hasChildNodes()) foreach($Element->childNodes as $curChild) if ($curChild->nodeName == "a") { # SNIP - Do something with A tags here } else if ($curChild instanceof DOMText) { convertToLinkIfNecc($Element, $curChild); } } echo $HTML->saveXML() . PHP_EOL; function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) { if( (strtolower($textContainer->nodeName) != 'a') && (filter_var($textNode->nodeValue, FILTER_VALIDATE_EMAIL) !== false) ) { convertMailtoToAnchor($textContainer, $textNode); } } function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode) { $newNode = new DomElement('a', $textNode->nodeValue); $textContainer->replaceChild($newNode, $textNode); $newNode->setAttribute('href', "mailto:{$textNode->nodeValue}"); } ?> so, the problem is iterating over a tree structure will only show you whats at the first level of the tree. this is why you need to call hasChildNodes(), and if that is true, call childNodes() and iterate across that (and really, the code should be doing the same thing there as well, calling hasChildNodes() and iterating over the results of childNodes()). the code i have shown will work for the html i posted, however it wont work on (x)html where these text nodes we're searching for are deeper in the tree than the second level. im sure you can cook up something that will recurse down to the leafs :) anyway, im going to try and hook up a RecursiveDOMDocumentIterator that implements RecursiveIterator so that it has the convenient foreach support. also, ill probly try to hook up a Filter varient of this class so that situations like this are trivial. stay tuned :D -nathan |
|
|||
|
Hi Nathan,
if you're already speaking of iterating children, i'd like to ask you another question: Basically i was trying to do the same thing as Tim, when i experienced some difficulties iterating over DOMElement->childNodes with foreach and manipulating strings inside the nodes or even replacing DOMElement/DOMNode/DOMText with another node. Instead, i am currently iterating like this: $child = $element->firstChild; while ($child != null) { $next_sibling = $child->nextSibling; // Do something with child (manipulate, replace, ...) // Continue iteration $child = $next_sibling } Is this correct, or is there any better way? Thank you in advance! Mario Nathan Nobbe schrieb: > bouncing back to the list so that others may benefit from our work... > > On Fri, Sep 5, 2008 at 3:09 PM, Tim Gustafson <tjg@soe.ucsc.edu> wrote: > >> Nathan, >> >> Thanks for the suggestion, but it's still not working for me. Here's my >> code: >> >> =========== >> $HTML = new DOMDocument(); >> @$HTML->loadHTML($text); >> $Elements = $HTML->getElementsByTagName("*"); >> >> for ($X = 0; $X < $Elements->length; $X++) { >> $Element = $Elements->item($X); >> >> if ($Element->tagName == "a") { >> # SNIP - Do something with A tags here >> } else if ($Element instanceof DOMText) { >> echo $Element->nodeValue; exit; >> } >> } >> =========== >> >> This loop never executes the instanceof part of the code. If I add: >> >> } else if ($Element instanceof DOMNode) { >> echo "foo!"; exit; >> } >> >> Then it echos "foo!" as expected. It just seems that none of the nodes in >> the tree are DOMText nodes. In fact, get_class($Element) returns >> "DOMElement" for every node in the tree. > > > Tim, > > i got your code working with minimal effort by pulling in two of the methods > i posted and making some revisions. scope it out, > (this will produce the same output as my last post (the part after OUT:)) > > <?php > $text = '<html><body>Test<br><h2>quickshiftin@gmail.com<a name="bar">stuff > inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>'; > $HTML = new DOMDocument(); > $HTML->loadHTML($text); > $Elements = $HTML->getElementsByTagName("*"); > > for ($X = 0; $X < $Elements->length; $X++) { > $Element = $Elements->item($X); > if($Element->hasChildNodes()) > foreach($Element->childNodes as $curChild) > if ($curChild->nodeName == "a") { > # SNIP - Do something with A tags here > } else if ($curChild instanceof DOMText) { > convertToLinkIfNecc($Element, $curChild); > } > } > echo $HTML->saveXML() . PHP_EOL; > > > function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) { > if( (strtolower($textContainer->nodeName) != 'a') && > (filter_var($textNode->nodeValue, FILTER_VALIDATE_EMAIL) !== false) > ) { > convertMailtoToAnchor($textContainer, $textNode); > } > } > function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode) > { > $newNode = new DomElement('a', $textNode->nodeValue); > $textContainer->replaceChild($newNode, $textNode); > $newNode->setAttribute('href', "mailto:{$textNode->nodeValue}"); > } > ?> > > so, the problem is iterating over a tree structure will only show you whats > at the first level of the tree. this is why you need to call > hasChildNodes(), and if that is true, call childNodes() and iterate across > that (and really, the code should be doing the same thing there as well, > calling hasChildNodes() and iterating over the results of childNodes()). > the code i have shown will work for the html i posted, however it wont work > on (x)html where these text nodes we're searching for are deeper in the tree > than the second level. im sure you can cook up something that will recurse > down to the leafs :) > anyway, im going to try and hook up a RecursiveDOMDocumentIterator that > implements RecursiveIterator so that it has the convenient foreach support. > also, ill probly try to hook up a Filter varient of this class so that > situations like this are trivial. > > stay tuned :D > > -nathan > |
|
|||
|
On Tue, Sep 9, 2008 at 12:37 AM, Mario Trojan <mtrojan@transline.de> wrote:
> Hi Nathan, > > if you're already speaking of iterating children, i'd like to ask you > another question: > > Basically i was trying to do the same thing as Tim, when i experienced some > difficulties iterating over DOMElement->childNodes with foreach and > manipulating strings inside the nodes or even replacing > DOMElement/DOMNode/DOMText with another node. Instead, i am currently > iterating like this: > > $child = $element->firstChild; > while ($child != null) { > $next_sibling = $child->nextSibling; > > // Do something with child (manipulate, replace, ...) > > // Continue iteration > $child = $next_sibling > } > > Is this correct, or is there any better way? i found this the other day on the DOMNodeList page on php.net, essentially foreach will implicitly do what you are doing under the hood, actually, it will also recurse into the children, whereas in this example youve shown, youre only iterating over 1 sub-level of the tree (horizontally across elements at the same level). sometimes it makes sense to drive the iteration yourself as you have shown, but i think the answer to your question is that you must use a reference to the parent to perform manipulations to the dom during iteration, see below (hope it helps :D), -nathan *a dot buffa at sns dot it* 29-May-2008 04:28 <http://us2.php.net/manual/en/class.domnodelist.php#83513> I agree with drichter at muvicom dot de. For istance, in order to delete each child node of a particular parent node, <?php while ($parentNode->hasChildNodes()){ $domNodeList = $parentNode->childNodes; $parentNode->removeChild($domNodeList->item(0)); } ?> In other word you have to uptade the DomNodeList on every iteration. In my opinion, the DomNodeList class is useless. |
|
|||
|
Nathan Nobbe wrote:
> > In my opinion, the DomNodeList class is useless. > agreed; ever tried making a replacement node class that extends it? then you see how useless it is! [yet a vital part of the dom structure] ot here; but I thought maybe useful for reference; I do loads of xml/dom api work and find that this little iterator is very very useful; I've trimmed it down but you'll find below how *I* iterate through the dom grabbing the important values.. private function iterateDom( $nodeList ) { foreach( $nodeList as $values ) { if( $values->nodeType == XML_ELEMENT_NODE ) { $nodeName = $values->nodeName; if( $values->attributes ) { for( $i=0;$values->attributes->item($i);$i++ ) { $attributeName = $values->attributes->item($i)->nodeName $attributeValue = $values->attributes->item($i)->nodeValue } } $values->children = $this->iterateDom( $values->childNodes ); $tempNode[$nodeName] = $values; } elseif( in_array($values->nodeType, array(XML_TEXT_NODE, XML_CDATA_SECTION_NODE)) ) { $nodeType = $values->nodeType; $nodeData = $values->data; } elseif( $values->nodeType === XML_PI_NODE ) { $DOMProcessingInstruction = array('target' => $values->target, 'data' => $values->data); } # other wise we ignore as all that's left is DOMComment } } might be useful for somebody |
|
|||
|
Nathan,
Thanks for your help on this. I actually need to do this a different way I think though. The problem is that I'm not just replacing a text entity with a link entity. For example, consider this paragraph: <p>For information, please contact tjg@soe.ucsc.edu.</p> In this case, I want "tjg@soe.ucsc.edu" to be a link, but not the rest of the paragraph. That means that the <p> entity has to be split into three separate entities - one DOMText for "For information, please contact ", one DOMEntity node for tjg@soe.ucsc.edu, and one DOMText node for ".". This seems doable with the DOM modle, but complicated. I'm thinking regular expressions might be the way to go again. :\ Tim Gustafson SOE Webmaster UC Santa Cruz tjg@soe.ucsc.edu 831-459-5354 |
|
|||
|
On Wed, Sep 10, 2008 at 10:35 AM, Tim Gustafson <tjg@soe.ucsc.edu> wrote:
> Nathan, > > Thanks for your help on this. > > I actually need to do this a different way I think though. The problem is > that I'm not just replacing a text entity with a link entity. For example, > consider this paragraph: > > <p>For information, please contact tjg@soe.ucsc.edu.</p> > > In this case, I want "tjg@soe.ucsc.edu" to be a link, but not the rest of > the paragraph. That means that the <p> entity has to be split into three > separate entities - one DOMText for "For information, please contact ", one > DOMEntity node for tjg@soe.ucsc.edu, and one DOMText node for ".". > > This seems doable with the DOM modle, but complicated. I'm thinking > regular > expressions might be the way to go again. :\ so use some regex :D thats the only way i know of to determine if DOMText nodes contain email address(s) as substrings while retaining ones sanity... i got it working, again by modifying the code from my original post and dropping in an additional clause which will use regex to determine if there is an email address embedded in a DOMText node, however, it checks to see if the whole thing is a mail first, cause i think thats a little optimization, but it could be ommitted. heres the output of the script now (notice i changed the input text to incorporate the new issue): nathan@devel ~/domIterator/initialTests $ php testDom.php IN: <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body>Test<br/><h2><b>quickshiftin@gmail.com</b></h2><p>text that we dont want to turn into a link.. quickshiftin@gmail.com</p><a name="bar">stuff inside the link</a>Foo<p>care</p><p>yoyser</p></body></html> OUT: <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body>Test<br/><h2><b><a href="mailto:quickshiftin@gmail.com"> quickshiftin@gmail.com</a></b></h2><p>text that we dont want to turn into a link.. <a href="mailto:quickshiftin@gmail.com">quickshiftin@ gmail.com</a></p><a name="bar">stuff inside the link</a>Foo<p>care</p><p>yoyser</p></body></html> and here is the code; sorry for the lengthy post fellas, i just want to post all of it rather than just attempting to illustrate the segments ive changed, <?php $doc = new DOMDocument(); $doc->loadHTML('<html><body>Test<br><h2><b>quickshiftin @gmail.com</b></h2><p>text that we dont want to turn into a link.. quickshiftin@gmail.com</p><a name="bar">stuff inside the link</a>Foo<p>care</p><p>yoyser</p></body></html>'); echo 'IN:' . PHP_EOL . $doc->saveXML() . PHP_EOL; findTextNodes($doc->getElementsByTagName('*'), 'convertToLinkIfNecc'); echo 'OUT: ' . PHP_EOL . $doc->saveXML() . PHP_EOL; /** * run through a DOMNodeList, looking for text nodes. apply a callback to * all such text nodes that are encountered */ function findTextNodes(DOMNodeList $nodesToSearch, $callback) { foreach($nodesToSearch as $curNode) { if($curNode->hasChildNodes()) foreach($curNode->childNodes as $curChild) if($curChild instanceof DOMText) call_user_func($callback, $curNode, $curChild); } } /** * determine if a node should be modified, by chcking to see if a child is a text node * and the text looks like an email address. * call a subordinate function to convert the text node into a mailto anchor DOMElement */ function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) { if(strtolower($textContainer->nodeName) === 'a') /// per original request dont bother w/ a tags return; if(filter_var($textNode->wholeText, FILTER_VALIDATE_EMAIL) !== false) { convertMailtoToAnchor($textContainer, $textNode); } else { /// lets see if theres an email burried in this text node /// regex taken from: http://www.regular-expressions.info/email.html preg_match('/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i', $textNode->wholeText, $matches); if(count($matches) > 0) rebuildTextNodeWithEmailAddrs($textContainer, $textNode, $matches); } } /** * given a DOMText instance w/ multiple email addresses, construct * a new set of nodes that contain the original text along w/ anchors for * all the bare email addresses */ function rebuildTextNodeWithEmailAddrs(DomElement $textContainer, DOMText $textNode, array $emailAddrs) { $nodeOrder = array(); /// construct array of elements $origText = $textNode->wholeText; foreach($emailAddrs as $curAddr) { $startPos = strpos($origText, $curAddr); // start pos of cur $txtBuff = substr($origText, 0, $startPos); // buffer so we can check if its empty if(!empty($txtBuff)) { $eltTokens[] = $txtBuff; $nodeOrder[] = 't'; // indicate this token is a textNode } $eltTokens[] = $curAddr; $nodeOrder[] = 'e'; // indicate this token is an email addr $origText = substr($origText, $startPos + strlen($curAddr)); } /// now that we have the tokens delete the orig DOMText and drop in the replacements $textContainer->removeChild($textNode); foreach($eltTokens as $tokenIndex => $curToken) { if($nodeOrder[$tokenIndex] == 't') $textContainer->appendChild(new DOMText($curToken)); else { convertMailtoToAnchor($textContainer, new DOMText($curToken), false); } } } /** * modify a DOMElement that has a DOMText node as a child; create a DOMElement * that represents and a tag, and set the value and href attirbute, so that it * acts as a 'mailto' link * @param $shouldReplaceChild boolean if true; replace $textNode by new node, otherwise append $textNode to new node */ function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode, $shouldReplaceChild=true) { $newNode = new DomElement('a', $textNode->nodeValue); if($shouldReplaceChild) $textContainer->replaceChild($newNode, $textNode); else $textContainer->appendChild($newNode); $newNode->setAttribute('href', "mailto:{$textNode->nodeValue}"); } essentially, what we do when encountering a DOMText that contains embedded email addresses, is tokenize the elements, by storing everything thats not an email address, and then the email addresses; so we have an array that looks like { some text that could be empty , emailAddr1@care.com , more non-email Text that could be empty , anotherEmail@care.com, ... } then we remove the original DOMText child node; and start adding new children, which are either DOMText instances or our sooped up DOMElement anchor tags for the email addresses. -nathan |