This is a discussion on Parsing a website - strategy within the PHP Language forums, part of the PHP Programming Forums category; Hi, recently I got a project to get info from different websites and to put the info into a DB. ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
Hi,
recently I got a project to get info from different websites and to put the info into a DB. Now, I was wondering what is the best technique to implement something like that. How I should open the pages from other websites. With fopen, throught a socket or with a curl. After that what is the faster way to parse a whole page for info.. and offcourse to parse it little times to get different info from the same page. Regards |
|
|||
|
aka_eu wrote:
> Hi, > > recently I got a project to get info from different websites and to put > the info into a DB. > Now, I was wondering what is the best technique to implement something > like that. > > How I should open the pages from other websites. With fopen, throught a > socket or with a curl. Either way works, depends what website you are accessing and what you need to do. If your answer to any of the questions if yes then use curl. Will your script need to auto-submit any forms to these websites? Do any of the sites use cookies? If a page is inaccessible do you need to know why? file_get_contents is the easiest way but not informative if the webpage was inacessible and it can only perform simple get requests. Curl can has comprehensive error reporting and you can post forms using setopt CURLOPT_POST and CURLOPT_POSTFIELDS, and it can deal with cookie based websites, pretend its a browser/bot and has plenty of other useful stuff. You could do all this yourself using sockets but its already been done with curl and sooo tedious. > > After that what is the faster way to parse a whole page for info.. and > offcourse to parse it little times to get different info from the same > page. Best use DOM. I've seen some people use regular expressions to do it but the regexes soon end up being a nightmare to maintain or change when the website inevitably changes. But if you're only looking for a few pieces of information from a few sites preg_match could work. With Dom you parse the page into a domtree using DOMDocument->loadHTML(), then use the dom methods and xpath to get what you want. Especially xpath.... Don't know if its fastest to execute during runtime but if anyone knows a more flexible, useful way of data mining I need to know. The dom method getElementById doesn't work unless the page has a proper doctype ( meaning most webpages ) http://blog.bitflux.ch/wiki/GetElementById_Pitfalls explains the problem and the solutions, there's a straightforward example of using xpath as well. http://www.zvon.org/xxl/XPathTutoria.../examples.html is a good xpath tutorial, ugly site but there's plenty of good examples to learn from and an interactive lab. Seeya Tim |