Bluehost.com Web Hosting $6.95

Parsing a website - strategy

This is a discussion on Parsing a website - strategy within the PHP Language forums, part of the PHP Programming Forums category; Hi, recently I got a project to get info from different websites and to put the info into a DB. ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 05-06-2006
aka_eu
 
Posts: n/a
Default Parsing a website - strategy

Hi,

recently I got a project to get info from different websites and to put
the info into a DB.
Now, I was wondering what is the best technique to implement something
like that.

How I should open the pages from other websites. With fopen, throught a
socket or with a curl.

After that what is the faster way to parse a whole page for info.. and
offcourse to parse it little times to get different info from the same
page.

Regards

Reply With Quote
  #2 (permalink)  
Old 05-06-2006
tihu
 
Posts: n/a
Default Re: Parsing a website - strategy

aka_eu wrote:
> Hi,
>
> recently I got a project to get info from different websites and to put
> the info into a DB.
> Now, I was wondering what is the best technique to implement something
> like that.
>
> How I should open the pages from other websites. With fopen, throught a
> socket or with a curl.


Either way works, depends what website you are accessing and what you
need to do. If your answer to any of the questions if yes then use
curl.
Will your script need to auto-submit any forms to these websites? Do
any of the sites use cookies? If a page is inaccessible do you need to
know why?

file_get_contents is the easiest way but not informative if the webpage
was inacessible and it can only perform simple get requests.

Curl can has comprehensive error reporting and you can post forms using
setopt CURLOPT_POST and CURLOPT_POSTFIELDS, and it can deal with cookie
based websites, pretend its a browser/bot and has plenty of other
useful stuff.

You could do all this yourself using sockets but its already been done
with curl and sooo tedious.

>
> After that what is the faster way to parse a whole page for info.. and
> offcourse to parse it little times to get different info from the same
> page.


Best use DOM.

I've seen some people use regular expressions to do it but the regexes
soon end up being a nightmare to maintain or change when the website
inevitably changes. But if you're only looking for a few pieces of
information from a few sites preg_match could work.

With Dom you parse the page into a domtree using
DOMDocument->loadHTML(), then use the dom methods and xpath to get what
you want. Especially xpath....

Don't know if its fastest to execute during runtime but if anyone knows
a more flexible, useful way of data mining I need to know.

The dom method getElementById doesn't work unless the page has a proper
doctype ( meaning most webpages )
http://blog.bitflux.ch/wiki/GetElementById_Pitfalls explains the
problem and the solutions, there's a straightforward example of using
xpath as well.
http://www.zvon.org/xxl/XPathTutoria.../examples.html is a good
xpath tutorial, ugly site but there's plenty of good examples to learn
from and an interactive lab.

Seeya

Tim

Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 04:45 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0