Bluehost.com Web Hosting $6.95

Using DOM textContent Property

This is a discussion on Using DOM textContent Property within the PHP General forums, part of the PHP Programming Forums category; bouncing back to the list so that others may benefit from our work... On Fri, Sep 5, 2008 at 3:...


Go Back   Usenet Forums > PHP Programming Forums > PHP General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 09-06-2008
Nathan Nobbe
 
Posts: n/a
Default [PHP] Using DOM textContent Property

bouncing back to the list so that others may benefit from our work...

On Fri, Sep 5, 2008 at 3:09 PM, Tim Gustafson <tjg@soe.ucsc.edu> wrote:

> Nathan,
>
> Thanks for the suggestion, but it's still not working for me. Here's my
> code:
>
> ===========
> $HTML = new DOMDocument();
> @$HTML->loadHTML($text);
> $Elements = $HTML->getElementsByTagName("*");
>
> for ($X = 0; $X < $Elements->length; $X++) {
> $Element = $Elements->item($X);
>
> if ($Element->tagName == "a") {
> # SNIP - Do something with A tags here
> } else if ($Element instanceof DOMText) {
> echo $Element->nodeValue; exit;
> }
> }
> ===========
>
> This loop never executes the instanceof part of the code. If I add:
>
> } else if ($Element instanceof DOMNode) {
> echo "foo!"; exit;
> }
>
> Then it echos "foo!" as expected. It just seems that none of the nodes in
> the tree are DOMText nodes. In fact, get_class($Element) returns
> "DOMElement" for every node in the tree.



Tim,

i got your code working with minimal effort by pulling in two of the methods
i posted and making some revisions. scope it out,
(this will produce the same output as my last post (the part after OUT:))

<?php
$text = '<html><body>Test<br><h2>quickshiftin@gmail.com<a name="bar">stuff
inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>';
$HTML = new DOMDocument();
$HTML->loadHTML($text);
$Elements = $HTML->getElementsByTagName("*");

for ($X = 0; $X < $Elements->length; $X++) {
$Element = $Elements->item($X);
if($Element->hasChildNodes())
foreach($Element->childNodes as $curChild)
if ($curChild->nodeName == "a") {
# SNIP - Do something with A tags here
} else if ($curChild instanceof DOMText) {
convertToLinkIfNecc($Element, $curChild);
}
}
echo $HTML->saveXML() . PHP_EOL;


function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) {
if( (strtolower($textContainer->nodeName) != 'a') &&
(filter_var($textNode->nodeValue, FILTER_VALIDATE_EMAIL) !== false)
) {
convertMailtoToAnchor($textContainer, $textNode);
}
}
function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode)
{
$newNode = new DomElement('a', $textNode->nodeValue);
$textContainer->replaceChild($newNode, $textNode);
$newNode->setAttribute('href', "mailto:{$textNode->nodeValue}");
}
?>

so, the problem is iterating over a tree structure will only show you whats
at the first level of the tree. this is why you need to call
hasChildNodes(), and if that is true, call childNodes() and iterate across
that (and really, the code should be doing the same thing there as well,
calling hasChildNodes() and iterating over the results of childNodes()).
the code i have shown will work for the html i posted, however it wont work
on (x)html where these text nodes we're searching for are deeper in the tree
than the second level. im sure you can cook up something that will recurse
down to the leafs :)
anyway, im going to try and hook up a RecursiveDOMDocumentIterator that
implements RecursiveIterator so that it has the convenient foreach support.
also, ill probly try to hook up a Filter varient of this class so that
situations like this are trivial.

stay tuned :D

-nathan

Reply With Quote
  #12 (permalink)  
Old 09-09-2008
Mario Trojan
 
Posts: n/a
Default Re: [PHP] Using DOM textContent Property

Hi Nathan,

if you're already speaking of iterating children, i'd like to ask you
another question:

Basically i was trying to do the same thing as Tim, when i experienced
some difficulties iterating over DOMElement->childNodes with foreach and
manipulating strings inside the nodes or even replacing
DOMElement/DOMNode/DOMText with another node. Instead, i am currently
iterating like this:

$child = $element->firstChild;
while ($child != null) {
$next_sibling = $child->nextSibling;

// Do something with child (manipulate, replace, ...)

// Continue iteration
$child = $next_sibling
}

Is this correct, or is there any better way?

Thank you in advance!
Mario


Nathan Nobbe schrieb:
> bouncing back to the list so that others may benefit from our work...
>
> On Fri, Sep 5, 2008 at 3:09 PM, Tim Gustafson <tjg@soe.ucsc.edu> wrote:
>
>> Nathan,
>>
>> Thanks for the suggestion, but it's still not working for me. Here's my
>> code:
>>
>> ===========
>> $HTML = new DOMDocument();
>> @$HTML->loadHTML($text);
>> $Elements = $HTML->getElementsByTagName("*");
>>
>> for ($X = 0; $X < $Elements->length; $X++) {
>> $Element = $Elements->item($X);
>>
>> if ($Element->tagName == "a") {
>> # SNIP - Do something with A tags here
>> } else if ($Element instanceof DOMText) {
>> echo $Element->nodeValue; exit;
>> }
>> }
>> ===========
>>
>> This loop never executes the instanceof part of the code. If I add:
>>
>> } else if ($Element instanceof DOMNode) {
>> echo "foo!"; exit;
>> }
>>
>> Then it echos "foo!" as expected. It just seems that none of the nodes in
>> the tree are DOMText nodes. In fact, get_class($Element) returns
>> "DOMElement" for every node in the tree.

>
>
> Tim,
>
> i got your code working with minimal effort by pulling in two of the methods
> i posted and making some revisions. scope it out,
> (this will produce the same output as my last post (the part after OUT:))
>
> <?php
> $text = '<html><body>Test<br><h2>quickshiftin@gmail.com<a name="bar">stuff
> inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>';
> $HTML = new DOMDocument();
> $HTML->loadHTML($text);
> $Elements = $HTML->getElementsByTagName("*");
>
> for ($X = 0; $X < $Elements->length; $X++) {
> $Element = $Elements->item($X);
> if($Element->hasChildNodes())
> foreach($Element->childNodes as $curChild)
> if ($curChild->nodeName == "a") {
> # SNIP - Do something with A tags here
> } else if ($curChild instanceof DOMText) {
> convertToLinkIfNecc($Element, $curChild);
> }
> }
> echo $HTML->saveXML() . PHP_EOL;
>
>
> function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) {
> if( (strtolower($textContainer->nodeName) != 'a') &&
> (filter_var($textNode->nodeValue, FILTER_VALIDATE_EMAIL) !== false)
> ) {
> convertMailtoToAnchor($textContainer, $textNode);
> }
> }
> function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode)
> {
> $newNode = new DomElement('a', $textNode->nodeValue);
> $textContainer->replaceChild($newNode, $textNode);
> $newNode->setAttribute('href', "mailto:{$textNode->nodeValue}");
> }
> ?>
>
> so, the problem is iterating over a tree structure will only show you whats
> at the first level of the tree. this is why you need to call
> hasChildNodes(), and if that is true, call childNodes() and iterate across
> that (and really, the code should be doing the same thing there as well,
> calling hasChildNodes() and iterating over the results of childNodes()).
> the code i have shown will work for the html i posted, however it wont work
> on (x)html where these text nodes we're searching for are deeper in the tree
> than the second level. im sure you can cook up something that will recurse
> down to the leafs :)
> anyway, im going to try and hook up a RecursiveDOMDocumentIterator that
> implements RecursiveIterator so that it has the convenient foreach support.
> also, ill probly try to hook up a Filter varient of this class so that
> situations like this are trivial.
>
> stay tuned :D
>
> -nathan
>

Reply With Quote
  #13 (permalink)  
Old 09-09-2008
Nathan Nobbe
 
Posts: n/a
Default Re: [PHP] Using DOM textContent Property

On Tue, Sep 9, 2008 at 12:37 AM, Mario Trojan <mtrojan@transline.de> wrote:

> Hi Nathan,
>
> if you're already speaking of iterating children, i'd like to ask you
> another question:
>
> Basically i was trying to do the same thing as Tim, when i experienced some
> difficulties iterating over DOMElement->childNodes with foreach and
> manipulating strings inside the nodes or even replacing
> DOMElement/DOMNode/DOMText with another node. Instead, i am currently
> iterating like this:
>
> $child = $element->firstChild;
> while ($child != null) {
> $next_sibling = $child->nextSibling;
>
> // Do something with child (manipulate, replace, ...)
>
> // Continue iteration
> $child = $next_sibling
> }
>
> Is this correct, or is there any better way?



i found this the other day on the DOMNodeList page on php.net,

essentially foreach will implicitly do what you are doing under the hood,
actually, it will also recurse into the children, whereas in this example
youve shown, youre only iterating over 1 sub-level of the tree (horizontally
across elements at the same level). sometimes it makes sense to drive the
iteration yourself as you have shown, but i think the answer to your
question is that you must use a reference to the parent to perform
manipulations to the dom during iteration, see below (hope it helps :D),

-nathan

*a dot buffa at sns dot it*
29-May-2008 04:28
<http://us2.php.net/manual/en/class.domnodelist.php#83513> I agree
with drichter at muvicom dot de.

For istance, in order to delete each child node of a particular parent node,

<?php

while ($parentNode->hasChildNodes()){
$domNodeList = $parentNode->childNodes;
$parentNode->removeChild($domNodeList->item(0));
}

?>

In other word you have to uptade the DomNodeList on every iteration.

In my opinion, the DomNodeList class is useless.

Reply With Quote
  #14 (permalink)  
Old 09-09-2008
Nathan Rixham
 
Posts: n/a
Default Re: [PHP] Using DOM textContent Property

Nathan Nobbe wrote:
>
> In my opinion, the DomNodeList class is useless.
>


agreed; ever tried making a replacement node class that extends it? then
you see how useless it is! [yet a vital part of the dom structure]

ot here; but I thought maybe useful for reference; I do loads of xml/dom
api work and find that this little iterator is very very useful; I've
trimmed it down but you'll find below how *I* iterate through the dom
grabbing the important values..

private function iterateDom( $nodeList )
{
foreach( $nodeList as $values ) {
if( $values->nodeType == XML_ELEMENT_NODE ) {
$nodeName = $values->nodeName;
if( $values->attributes ) {
for( $i=0;$values->attributes->item($i);$i++ ) {
$attributeName = $values->attributes->item($i)->nodeName
$attributeValue = $values->attributes->item($i)->nodeValue
}
}
$values->children = $this->iterateDom( $values->childNodes );
$tempNode[$nodeName] = $values;
} elseif( in_array($values->nodeType, array(XML_TEXT_NODE,
XML_CDATA_SECTION_NODE)) ) {
$nodeType = $values->nodeType;
$nodeData = $values->data;
} elseif( $values->nodeType === XML_PI_NODE ) {
$DOMProcessingInstruction = array('target' => $values->target,
'data' => $values->data);
}
# other wise we ignore as all that's left is DOMComment
}
}

might be useful for somebody
Reply With Quote
  #15 (permalink)  
Old 09-10-2008
Tim Gustafson
 
Posts: n/a
Default RE: [PHP] Using DOM textContent Property

Nathan,

Thanks for your help on this.

I actually need to do this a different way I think though. The problem is
that I'm not just replacing a text entity with a link entity. For example,
consider this paragraph:

<p>For information, please contact tjg@soe.ucsc.edu.</p>

In this case, I want "tjg@soe.ucsc.edu" to be a link, but not the rest of
the paragraph. That means that the <p> entity has to be split into three
separate entities - one DOMText for "For information, please contact ", one
DOMEntity node for tjg@soe.ucsc.edu, and one DOMText node for ".".

This seems doable with the DOM modle, but complicated. I'm thinking regular
expressions might be the way to go again. :\

Tim Gustafson
SOE Webmaster
UC Santa Cruz
tjg@soe.ucsc.edu
831-459-5354

Reply With Quote
  #16 (permalink)  
Old 09-10-2008
Nathan Nobbe
 
Posts: n/a
Default Re: [PHP] Using DOM textContent Property

On Wed, Sep 10, 2008 at 10:35 AM, Tim Gustafson <tjg@soe.ucsc.edu> wrote:

> Nathan,
>
> Thanks for your help on this.
>
> I actually need to do this a different way I think though. The problem is
> that I'm not just replacing a text entity with a link entity. For example,
> consider this paragraph:
>
> <p>For information, please contact tjg@soe.ucsc.edu.</p>
>
> In this case, I want "tjg@soe.ucsc.edu" to be a link, but not the rest of
> the paragraph. That means that the <p> entity has to be split into three
> separate entities - one DOMText for "For information, please contact ", one
> DOMEntity node for tjg@soe.ucsc.edu, and one DOMText node for ".".
>
> This seems doable with the DOM modle, but complicated. I'm thinking
> regular
> expressions might be the way to go again. :\



so use some regex :D thats the only way i know of to determine if DOMText
nodes contain email address(s) as substrings while retaining ones sanity...
i got it working, again by modifying the code from my original post and
dropping in an additional clause which will use regex to determine if there
is an email address embedded in a DOMText node, however, it checks to see if
the whole thing is a mail first, cause i think thats a little optimization,
but it could be ommitted. heres the output of the script now (notice i
changed the input text to incorporate the new issue):

nathan@devel ~/domIterator/initialTests $ php testDom.php
IN:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Test<br/><h2><b>quickshiftin@gmail.com</b></h2><p>text that we
dont want to turn into a link.. quickshiftin@gmail.com</p><a
name="bar">stuff inside the
link</a>Foo<p>care</p><p>yoyser</p></body></html>

OUT:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Test<br/><h2><b><a href="mailto:quickshiftin@gmail.com">
quickshiftin@gmail.com</a></b></h2><p>text that we dont want to turn into a
link.. <a href="mailto:quickshiftin@gmail.com">quickshiftin@ gmail.com</a></p><a
name="bar">stuff inside the
link</a>Foo<p>care</p><p>yoyser</p></body></html>

and here is the code; sorry for the lengthy post fellas, i just want to post
all of it rather than just attempting to illustrate the segments ive
changed,

<?php
$doc = new DOMDocument();
$doc->loadHTML('<html><body>Test<br><h2><b>quickshiftin @gmail.com</b></h2><p>text
that we dont want to turn into a link.. quickshiftin@gmail.com</p><a
name="bar">stuff inside the
link</a>Foo<p>care</p><p>yoyser</p></body></html>');
echo 'IN:' . PHP_EOL . $doc->saveXML() . PHP_EOL;
findTextNodes($doc->getElementsByTagName('*'), 'convertToLinkIfNecc');
echo 'OUT: ' . PHP_EOL . $doc->saveXML() . PHP_EOL;

/**
* run through a DOMNodeList, looking for text nodes. apply a callback to
* all such text nodes that are encountered
*/
function findTextNodes(DOMNodeList $nodesToSearch, $callback) {
foreach($nodesToSearch as $curNode) {
if($curNode->hasChildNodes())
foreach($curNode->childNodes as $curChild)
if($curChild instanceof DOMText)
call_user_func($callback, $curNode, $curChild);
}
}

/**
* determine if a node should be modified, by chcking to see if a child is a
text node
* and the text looks like an email address.
* call a subordinate function to convert the text node into a mailto anchor
DOMElement
*/
function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) {
if(strtolower($textContainer->nodeName) === 'a') /// per original
request dont bother w/ a tags
return;
if(filter_var($textNode->wholeText, FILTER_VALIDATE_EMAIL) !== false) {
convertMailtoToAnchor($textContainer, $textNode);
} else { /// lets see if theres an email burried in this text node
/// regex taken from: http://www.regular-expressions.info/email.html
preg_match('/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i',
$textNode->wholeText, $matches);
if(count($matches) > 0)
rebuildTextNodeWithEmailAddrs($textContainer, $textNode,
$matches);
}
}

/**
* given a DOMText instance w/ multiple email addresses, construct
* a new set of nodes that contain the original text along w/ anchors for
* all the bare email addresses
*/
function rebuildTextNodeWithEmailAddrs(DomElement $textContainer, DOMText
$textNode, array $emailAddrs) {
$nodeOrder = array();
/// construct array of elements
$origText = $textNode->wholeText;
foreach($emailAddrs as $curAddr) {
$startPos = strpos($origText, $curAddr); // start pos of cur
email
$txtBuff = substr($origText, 0, $startPos); // buffer so we can
check if its empty
if(!empty($txtBuff)) {
$eltTokens[] = $txtBuff;
$nodeOrder[] = 't'; // indicate this token is a textNode
}
$eltTokens[] = $curAddr;
$nodeOrder[] = 'e'; // indicate this token is an email addr
$origText = substr($origText, $startPos + strlen($curAddr));
}
/// now that we have the tokens delete the orig DOMText and drop in the
replacements
$textContainer->removeChild($textNode);
foreach($eltTokens as $tokenIndex => $curToken) {
if($nodeOrder[$tokenIndex] == 't')
$textContainer->appendChild(new DOMText($curToken));
else {
convertMailtoToAnchor($textContainer, new DOMText($curToken),
false);
}
}
}

/**
* modify a DOMElement that has a DOMText node as a child; create a
DOMElement
* that represents and a tag, and set the value and href attirbute, so that
it
* acts as a 'mailto' link
* @param $shouldReplaceChild boolean if true; replace $textNode by new
node, otherwise append $textNode to new node
*/
function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode,
$shouldReplaceChild=true) {
$newNode = new DomElement('a', $textNode->nodeValue);
if($shouldReplaceChild)
$textContainer->replaceChild($newNode, $textNode);
else
$textContainer->appendChild($newNode);
$newNode->setAttribute('href', "mailto:{$textNode->nodeValue}");
}

essentially, what we do when encountering a DOMText that contains embedded
email addresses, is tokenize the elements, by storing everything thats not
an email address, and then the email addresses; so we have an array that
looks like
{ some text that could be empty , emailAddr1@care.com , more non-email Text
that could be empty , anotherEmail@care.com, ... }
then we remove the original DOMText child node; and start adding new
children, which are either DOMText instances or our sooped up DOMElement
anchor tags for the email addresses.

-nathan

Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 11:51 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0