using PHP to parse through HTML

This is a discussion on using PHP to parse through HTML within the PHP Language forums, part of the PHP Programming Forums category; Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF attributes of anchor tags ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 02-19-2005
laredotornado@gmail.com
 
Posts: n/a
Default using PHP to parse through HTML

Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
attributes of anchor tags and SRC attributes of IMG tags. Does anyone
know of any libraries/freeware to help parse through HTML to find these
things. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.

Thanks for any help, - Dave

Reply With Quote
  #2 (permalink)  
Old 02-19-2005
Andy Hassall
 
Posts: n/a
Default Re: using PHP to parse through HTML

On 19 Feb 2005 11:49:24 -0800, laredotornado@gmail.com wrote:

>Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
>attributes of anchor tags and SRC attributes of IMG tags. Does anyone
>know of any libraries/freeware to help parse through HTML to find these
>things. Right now, I'm doing a lot of "strstr" calls, but there is
>probably a better way to do what I need.


Haven't used it myself, but seen mentions of:

http://pear.php.net/package/XML_HTMLSax

... which looks possibly suitable from the description on the page.

--
Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Reply With Quote
  #3 (permalink)  
Old 02-20-2005
Dave Patton
 
Posts: n/a
Default Re: using PHP to parse through HTML

laredotornado@gmail.com wrote in
news:1108842564.846225.81750@c13g2000cwb.googlegro ups.com:

> Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
> attributes of anchor tags and SRC attributes of IMG tags. Does anyone
> know of any libraries/freeware to help parse through HTML to find these
> things. Right now, I'm doing a lot of "strstr" calls, but there is
> probably a better way to do what I need.


Take a look at preg_split()
http://www.php.net/manual/en/function.preg-split.php

--
Dave Patton
Canadian Coordinator, Degree Confluence Project
http://www.confluence.org/
My website: http://members.shaw.ca/davepatton/
Reply With Quote
  #4 (permalink)  
Old 02-20-2005
laredotornado@zipmail.com
 
Posts: n/a
Default Re: using PHP to parse through HTML

Too bad none of the examples work. I untarred/uncompressed the file,
copied the folder to a public html directory and then every time I try
and launch an example, I get errors like

Warning: main(XML/HTMLSax/XML_HTMLSax_States.php): failed to open
stream: No such file or directory in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36

Fatal error: main(): Failed opening required
'XML/HTMLSax/XML_HTMLSax_States.php'
(include_path='.:/usr/local/lib/php') in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36


Andy Hassall wrote:
> On 19 Feb 2005 11:49:24 -0800, laredotornado@gmail.com wrote:
>
> >Hi, I'm using PHP 4 and trying to parse through HTML to look for

HREF
> >attributes of anchor tags and SRC attributes of IMG tags. Does

anyone
> >know of any libraries/freeware to help parse through HTML to find

these
> >things. Right now, I'm doing a lot of "strstr" calls, but there is
> >probably a better way to do what I need.

>
> Haven't used it myself, but seen mentions of:
>
> http://pear.php.net/package/XML_HTMLSax
>
> ... which looks possibly suitable from the description on the page.
>
> --
> Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk>
> <http://www.andyhsoftware.co.uk/space> Space: disk usage analysis

tool

Reply With Quote
  #5 (permalink)  
Old 02-20-2005
steve
 
Posts: n/a
Default Re: using PHP to parse through HTML

"laredotornado" wrote:
> Hi, I'm using PHP 4 and trying to parse through HTML to look
> for HREF
> attributes of anchor tags and SRC attributes of IMG tags.
> Does anyone
> know of any libraries/freeware to help parse through HTML to
> find these
> things. Right now, I'm doing a lot of "strstr" calls, but
> there is
> probably a better way to do what I need.
>
> Thanks for any help, - Dave


strstr is the LAST thing you want to do in this case! I don’t know
of libraries, but you can use preg_match to grab the tags that you
need.

If you are into php, learning preg_match and regular expressions in
general is almost a must.. it will substantially increase the power
of your code.

steve

--
Posted using the http://www.dbforumz.com interface, at author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbforumz.com/PHP-parse-HT...ict199658.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbforumz.com/eform.php?p=677948
Reply With Quote
  #6 (permalink)  
Old 02-20-2005
Simon
 
Posts: n/a
Default Re: using PHP to parse through HTML

>
> strstr is the LAST thing you want to do in this case! I don't know
> of libraries, but you can use preg_match to grab the tags that you
> need.
>
> If you are into php, learning preg_match and regular expressions in
> general is almost a must.. it will substantially increase the power
> of your code.
>
> steve
>
> --



Sorry can you elaborate on you first statement.
Are you saying that "strstr" is slower that "preg_match"? what about
"strpos"?

The reason I ask is, if it was faster to look for a character in string
using "preg_match" then why wouldn't strpos/strstr us it themselves?

I need to look for 2 characters in some data, (case sensitive), what would
be the fastest way of finding the first occurrence?

$first = strpos( $data, $charA );
$sec = strpos( $data, $charB );
// check for ===false;
return ($first<$sec)?$first:$sec;

// would there be a faster way to achieve the above using "preg_match"?

Simon


Reply With Quote
  #7 (permalink)  
Old 02-20-2005
Andy Hassall
 
Posts: n/a
Default Re: using PHP to parse through HTML

On 19 Feb 2005 20:22:22 -0800, laredotornado@zipmail.com wrote:

>Andy Hassall wrote:
>> On 19 Feb 2005 11:49:24 -0800, laredotornado@gmail.com wrote:
>>
>> >Hi, I'm using PHP 4 and trying to parse through HTML to look for

>HREF
>> >attributes of anchor tags and SRC attributes of IMG tags. Does

>anyone
>> >know of any libraries/freeware to help parse through HTML to find

>these
>> >things. Right now, I'm doing a lot of "strstr" calls, but there is
>> >probably a better way to do what I need.

>>
>> Haven't used it myself, but seen mentions of:
>>
>> http://pear.php.net/package/XML_HTMLSax
>>
>> ... which looks possibly suitable from the description on the page.

>
>Too bad none of the examples work. I untarred/uncompressed the file,
>copied the folder to a public html directory


That's not how you're supposed to install PEAR modules; here's an example how:

root@server:~# pear install http://pear.php.net/get/XML_HTMLSax-2.1.2.tgz
downloading XML_HTMLSax-2.1.2.tgz ...
Starting to download XML_HTMLSax-2.1.2.tgz (16,099 bytes)
.......done: 16,099 bytes
install ok: XML_HTMLSax 2.1.2

You could probably get away with unpacking to a public_html directory but
you'd need to fiddle with your include_path else you get errors like:

>Warning: main(XML/HTMLSax/XML_HTMLSax_States.php): failed to open
>stream: No such file or directory in
>/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36


The examples work OK for me after installing through pear as above.

--
Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Reply With Quote
  #8 (permalink)  
Old 02-20-2005
petrovitch
 
Posts: n/a
Default Re: using PHP to parse through HTML

I posted an example at: http://hotscripts.com/Detailed/44390.html

Reply With Quote
  #9 (permalink)  
Old 02-21-2005
steve
 
Posts: n/a
Default Re: Re: using PHP to parse through HTML

"Simon" wrote:
>>
>> strstr is the LAST thing you want to do in this case! I

>don’t know
>> of libraries, but you can use preg_match to grab the tags that

>you
>> need.
>>
>> If you are into php, learning preg_match and regular expressions

>in
>> general is almost a must.. it will substantially increase the

>power
>> of your code.
>>
>> steve
>>
>> --

>
>
>Sorry can you elaborate on you first statement.
>Are you saying that "strstr" is slower that "preg_match"? what

about
>"strpos"?
>
>The reason I ask is, if it was faster to look for a character in
>string
>using "preg_match" then why wouldn’t strpos/strstr us it
>themselves?
>
>I need to look for 2 characters in some data, (case sensitive), what
>would
>be the fastest way of finding the first occurrence?
>
>$first = strpos( $data, $charA );
>$sec = strpos( $data, $charB );
>// check for ===false;
>return ($first<$sec)?$first:$sec;
>
>// would there be a faster way to achieve the above using
>"preg_match"?
>
>Simon


Simon, in 99% of the cases, speed does not matter, i.e. you can
achieve good speed regardless --not something I have ever had to worry
about in the code. The point is that with preg_match and regex, you
can achieve with one statement what it takes 10 statement to achive,
if you did not have regex. If you ever parse free text in any shape
or form, regex is the way to go. Your example above is simple and if
that is all you need fine, but as soon as the text has spurious (sp?)
spaces, other characters that may or may not be present, and a whole
bunch of other conditions outside your control, you need a much more
powerful engine, and that is regex.

--
Posted using the http://www.dbforumz.com interface, at author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbforumz.com/PHP-parse-HT...ict199658.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbforumz.com/eform.php?p=678383
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 10:39 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0