Bluehost.com Web Hosting $6.95

Using DOM textContent Property

This is a discussion on Using DOM textContent Property within the PHP General forums, part of the PHP Programming Forums category; Hello, I am writing a filter in PHP that takes some HTML as input and goes through the HTML and ...


Go Back   Usenet Forums > PHP Programming Forums > PHP General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 09-02-2008
Tim Gustafson
 
Posts: n/a
Default Using DOM textContent Property

Hello,

I am writing a filter in PHP that takes some HTML as input and goes through
the HTML and adjusts certain tag attributes as needed. So, for example, if
<a> tag is missing the "title" attribute, this filter adds a title attribute
to the <a> tag.

I'm doing this all using PHP 5 and the DOM parsing library, and it's working
really well.

The one snafu I'm running in to is dealing with users who will just type an
e-mail address into an HTML document without actually making it a link - so,
they'll just put foo@bar.com rather than <a
href="mailto:foo@bar.com">foo@bar.com</a>. I'd like for these incorrectly
entered e-mail addresses to magically change into real clickable links, so
I'd like my filter to be able to grab those plain text e-mail addresses and
convert them to actual clickable links.

I tried iterating through all the elements on a page using something like
this:

$Elements = $HTML->getElementsByTagName("*");

for ($X = 0; $X < $Elements->length; $X++) {
... SNIP ...
}

And then I tried looking at the textContent property of each node, but it
seems that higher-level nodes include all the text of their children nodes
(which is what the DOM documents say it should). But there doesn't appear
to be any way to know if the textContent you've got is for just one node, or
for a whole bunch of nodes. Is there any way to figure that out, so that I
can adjust the textContent property of just the lowest-level nodes, rather
than mucking up the higher-level ones?

Tim Gustafson
SOE Webmaster
UC Santa Cruz
tjg@soe.ucsc.edu
831-459-5354


Reply With Quote
  #2 (permalink)  
Old 09-02-2008
Nathan Nobbe
 
Posts: n/a
Default Re: [PHP] Using DOM textContent Property

On Tue, Sep 2, 2008 at 3:18 PM, Tim Gustafson <tjg@soe.ucsc.edu> wrote:

> And then I tried looking at the textContent property of each node, but it
> seems that higher-level nodes include all the text of their children nodes
> (which is what the DOM documents say it should). But there doesn't appear
> to be any way to know if the textContent you've got is for just one node,
> or
> for a whole bunch of nodes. Is there any way to figure that out, so that I
> can adjust the textContent property of just the lowest-level nodes, rather
> than mucking up the higher-level ones? <http://www.php.net/unsub.php>
>


if a node has children, then its not a leaf, so i imagine you could continue
to traverse until you reach the leaf that actually has the address needing
magical conversion..

also, for a performance increase, if you dont find a match at a high level,
you could skip that entire sub-section of the tree; no need to go down to a
leaf if you know theres no magic needed for the current branch :)

-nathan

Reply With Quote
  #3 (permalink)  
Old 09-03-2008
Tim Gustafson
 
Posts: n/a
Default RE: [PHP] Using DOM textContent Property

> if a node has children, then its not a leaf, so i imagine
> you could continue to traverse until you reach the leaf
> that actually has the address needing magical conversion.


I tried that. $Element->hasChildNodes() returns true for just about
everything except tags like <br> and <img> that have no corresponding </br>
or </img> because the content that appears between <p> and </p>, for
example, apparently counts as a child node, even though they're not HTML
tags. So, if you have:

<p>Foo!</p>

when you look at $Element->hasChildNodes() for the <p> tag, you will get
"true", and $Element->childNodes->length is equal to "1", even though "Foo!"
isn't an HTML tag. Interestingly though, when you iterate through the tree,
you get the <p> tag as one of the elements, but you never get a text-only
element that has that <p> as a parentNode. In fact, get_class($Element)
always returns DOMElement, even on the text-only nodes, which I would have
expected to be DOMText elements...but I guess not. So I'm wondering why
$Element->hasChildNodes() would return true, but iterating through the DOM
tree returns no elements that have that $Element as a parentNode.

What's more, looking at $Element->childNodex->length isn't too helpful,
because, for example:

<h2><a name="bar"></a>Foo</h2>

returns two child nodes, neither of which has "Foo" for its textContent.

Tim Gustafson
SOE Webmaster
UC Santa Cruz
tjg@soe.ucsc.edu
831-459-5354


Reply With Quote
  #4 (permalink)  
Old 09-03-2008
Lupus Michaelis
 
Posts: n/a
Default Re: Using DOM textContent Property

Tim Gustafson a écrit :

> $Elements = $HTML->getElementsByTagName("*");
>
> for ($X = 0; $X < $Elements->length; $X++) {
> ... SNIP ...
> }


Why don't use the XPath ?
<http://fr.php.net/manual/en/class.domxpath.php>
<http://fr.php.net/manual/en/domxpath.query.php>

This query fetch all a elements with no title attribute or empty
title attribute : '//a[not(@title) or @title = ""]' ;


--
Mickaël Wolff aka Lupus Michaelis
http://lupusmic.org
Reply With Quote
  #5 (permalink)  
Old 09-03-2008
php@logi.ca
 
Posts: n/a
Default Re: [PHP] Using DOM textContent Property

Tim Gustafson wrote:
> Hello,
>
> I am writing a filter in PHP that takes some HTML as input and goes through
> the HTML and adjusts certain tag attributes as needed. So, for example, if
> <a> tag is missing the "title" attribute, this filter adds a title attribute
> to the <a> tag.
>
> I'm doing this all using PHP 5 and the DOM parsing library, and it's working
> really well.
>
> The one snafu I'm running in to is dealing with users who will just type an
> e-mail address into an HTML document without actually making it a link - so,
> they'll just put foo@bar.com rather than <a
> href="mailto:foo@bar.com">foo@bar.com</a>. I'd like for these incorrectly
> entered e-mail addresses to magically change into real clickable links, so
> I'd like my filter to be able to grab those plain text e-mail addresses and
> convert them to actual clickable links.
>
> I tried iterating through all the elements on a page using something like
> this:
>
> $Elements = $HTML->getElementsByTagName("*");
>
> for ($X = 0; $X < $Elements->length; $X++) {
> ... SNIP ...
> }
>


I think you might be better off using regexp on the text *before*
sending it through the DOM parser. Send the user's text through a
function that searches for URLs and email addresses, creating proper
links as they're found, then use the output from that to move on to your
DOM stuff. That way, you need not create new nodes in your nodelist.

Reply With Quote
  #6 (permalink)  
Old 09-03-2008
php@logi.ca
 
Posts: n/a
Default Re: [PHP] Re: Using DOM textContent Property

Lupus Michaelis wrote:
> Tim Gustafson a écrit :
>
>> $Elements = $HTML->getElementsByTagName("*");
>>
>> for ($X = 0; $X < $Elements->length; $X++) {
>> ... SNIP ...
>> }

>
> Why don't use the XPath ?
> <http://fr.php.net/manual/en/class.domxpath.php>
> <http://fr.php.net/manual/en/domxpath.query.php>
>
> This query fetch all a elements with no title attribute or empty title
> attribute : '//a[not(@title) or @title = ""]' ;
>
>


That example was for finding email addresses and turning them into
links, not the other thing about adding missing attributes. XPATH would
be no help with the former.
Reply With Quote
  #7 (permalink)  
Old 09-03-2008
Tim Gustafson
 
Posts: n/a
Default RE: [PHP] Using DOM textContent Property

> I think you might be better off using regexp on the text
> *before* sending it through the DOM parser. Send the
> user's text through a function that searches for URLs
> and email addresses, creating proper links as they're
> found, then use the output from that to move on to your
> DOM stuff. That way, you need not create new nodes in
> your nodelist.


I think that's the way I'm going to have to go, but I was really hoping not
to. Thanks for the suggestion!

Tim Gustafson
SOE Webmaster
UC Santa Cruz
tjg@soe.ucsc.edu
831-459-5354



Reply With Quote
  #8 (permalink)  
Old 09-03-2008
Lupus Michaelis
 
Posts: n/a
Default Re: [PHP] Re: Using DOM textContent Property

php@logi.ca a écrit :

> That example was for finding email addresses and turning them into
> links, not the other thing about adding missing attributes. XPATH would
> be no help with the former.


You're right, I misunderstood :-/ sorry for the noise.

--
Mickaël Wolff aka Lupus Michaelis
http://lupusmic.org
Reply With Quote
  #9 (permalink)  
Old 09-03-2008
Nathan Nobbe
 
Posts: n/a
Default Re: [PHP] Using DOM textContent Property

On Wed, Sep 3, 2008 at 10:03 AM, Tim Gustafson <tjg@soe.ucsc.edu> wrote:

> > I think you might be better off using regexp on the text
> > *before* sending it through the DOM parser. Send the
> > user's text through a function that searches for URLs
> > and email addresses, creating proper links as they're
> > found, then use the output from that to move on to your
> > DOM stuff. That way, you need not create new nodes in
> > your nodelist.

>
> I think that's the way I'm going to have to go, but I was really hoping not
> to. Thanks for the suggestion!



i think i have what youre looking for Tim, take a look at this script output

nathan@devel ~ $ php testDom.php
IN:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Test<br/><h2>quickshiftin@gmail.com<a name="bar">stuff inside
the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>

OUT:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Test<br/><h2><a href="mailto:quickshiftin@gmail.com">
quickshiftin@gmail.com</a><a name="bar">stuff inside the
link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>

and heres the code using the DOM extension
you may have to tweak it to suit your needs, but currently i think it does
the trick ;)

<?php
$doc = new DOMDocument();
$doc->loadHTML('<html><body>Test<br><h2>quickshiftin@gm ail.com<a
name="bar">stuff inside the
link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>');
echo 'IN:' . PHP_EOL . $doc->saveXML() . PHP_EOL;
findTextNodes($doc->getElementsByTagName('*'), 'convertToLinkIfNecc');
echo 'OUT: ' . PHP_EOL . $doc->saveXML() . PHP_EOL;

/**
* run through a DOMNodeList, looking for text nodes. apply a callback to
* all such text nodes that are encountered
*/
function findTextNodes(DOMNodeList $nodesToSearch, $callback) {
foreach($nodesToSearch as $curNode) {
if($curNode->hasChildNodes())
foreach($curNode->childNodes as $curChild)
if($curChild instanceof DOMText)
#echo "TEXT NODE FOUND: " . $curChild->nodeValue .
PHP_EOL;
/// todo: allow use of hook here
call_user_func($callback, $curNode, $curChild);
}
}

/**
* determine if a node should be modified, by chcking to see if a child is a
text node
* and the text looks like an email address.
* call a subordinate function to convert the text node into a mailto anchor
DOMElement
*/
function convertToLinkIfNecc(DomElement $textContainer, DOMText $textNode) {
if( (strtolower($textContainer->nodeName) != 'a') &&
(filter_var($textNode->nodeValue, FILTER_VALIDATE_EMAIL) !== false)
) {
convertMailtoToAnchor($textContainer, $textNode);
}
}

/**
* modify a DOMElement that has a DOMText node as a child; create a
DOMElement
* that represents and a tag, and set the value and href attirbute, so that
it
* acts as a 'mailto' link
*/
function convertMailtoToAnchor(DomElement $textContainer, DOMText $textNode)
{
$newNode = new DomElement('a', $textNode->nodeValue);
$textContainer->replaceChild($newNode, $textNode);
$newNode->setAttribute('href', "mailto:{$textNode->nodeValue}");
}


-nathan

Reply With Quote
  #10 (permalink)  
Old 09-05-2008
Tim Gustafson
 
Posts: n/a
Default RE: [PHP] Using DOM textContent Property

Nathan,

Thanks for the suggestion, but it's still not working for me. Here's my
code:

===========
$HTML = new DOMDocument();
@$HTML->loadHTML($text);
$Elements = $HTML->getElementsByTagName("*");

for ($X = 0; $X < $Elements->length; $X++) {
$Element = $Elements->item($X);

if ($Element->tagName == "a") {
# SNIP - Do something with A tags here
} else if ($Element instanceof DOMText) {
echo $Element->nodeValue; exit;
}
}
===========

This loop never executes the instanceof part of the code. If I add:

} else if ($Element instanceof DOMNode) {
echo "foo!"; exit;
}

Then it echos "foo!" as expected. It just seems that none of the nodes in
the tree are DOMText nodes. In fact, get_class($Element) returns
"DOMElement" for every node in the tree.

Tim Gustafson
SOE Webmaster
UC Santa Cruz
tjg@soe.ucsc.edu
831-459-5354






________________________________

From: Nathan Nobbe [mailto:quickshiftin@gmail.com]
Sent: Wednesday, September 03, 2008 11:55 AM
To: Tim Gustafson
Cc: php@logi.ca; php-general@lists.php.net
Subject: Re: [php] Using DOM textContent Property


On Wed, Sep 3, 2008 at 10:03 AM, Tim Gustafson <tjg@soe.ucsc.edu>
wrote:


> I think you might be better off using regexp on the text
> *before* sending it through the DOM parser. Send the
> user's text through a function that searches for URLs
> and email addresses, creating proper links as they're
> found, then use the output from that to move on to your
> DOM stuff. That way, you need not create new nodes in
> your nodelist.



I think that's the way I'm going to have to go, but I was
really hoping not
to. Thanks for the suggestion!


i think i have what youre looking for Tim, take a look at this
script output

nathan@devel ~ $ php testDom.php
IN:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Test<br/><h2>quickshiftin@gmail.com<a name="bar">stuff
inside the link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>

OUT:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Test<br/><h2><a
href="mailto:quickshiftin@gmail.com">quickshiftin@ gmail.com</a><a
name="bar">stuff inside the
link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>

and heres the code using the DOM extension
you may have to tweak it to suit your needs, but currently i think
it does the trick ;)

<?php
$doc = new DOMDocument();
$doc->loadHTML('<html><body>Test<br><h2>quickshiftin@gm ail.com<a
name="bar">stuff inside the
link</a>Foo</h2><p>care</p><p>yoyser</p></body></html>');
echo 'IN:' . PHP_EOL . $doc->saveXML() . PHP_EOL;
findTextNodes($doc->getElementsByTagName('*'),
'convertToLinkIfNecc');
echo 'OUT: ' . PHP_EOL . $doc->saveXML() . PHP_EOL;

/**
* run through a DOMNodeList, looking for text nodes. apply a
callback to
* all such text nodes that are encountered
*/
function findTextNodes(DOMNodeList $nodesToSearch, $callback) {
foreach($nodesToSearch as $curNode) {
if($curNode->hasChildNodes())
foreach($curNode->childNodes as $curChild)
if($curChild instanceof DOMText)
#echo "TEXT NODE FOUND: " . $curChild->nodeValue
.. PHP_EOL;
/// todo: allow use of hook here
call_user_func($callback, $curNode, $curChild);
}
}

/**
* determine if a node should be modified, by chcking to see if a
child is a text node
* and the text looks like an email address.
* call a subordinate function to convert the text node into a
mailto anchor DOMElement
*/
function convertToLinkIfNecc(DomElement $textContainer, DOMText
$textNode) {
if( (strtolower($textContainer->nodeName) != 'a') &&
(filter_var($textNode->nodeValue, FILTER_VALIDATE_EMAIL) !==
false) ) {
convertMailtoToAnchor($textContainer, $textNode);
}
}

/**
* modify a DOMElement that has a DOMText node as a child; create a
DOMElement
* that represents and a tag, and set the value and href attirbute,
so that it
* acts as a 'mailto' link
*/
function convertMailtoToAnchor(DomElement $textContainer, DOMText
$textNode) {
$newNode = new DomElement('a', $textNode->nodeValue);
$textContainer->replaceChild($newNode, $textNode);
$newNode->setAttribute('href', "mailto:{$textNode->nodeValue}");
}


-nathan




Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 11:52 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0