web harvesting

This is a discussion on web harvesting within the alt.comp.lang.php forums, part of the PHP Programming Forums category; I have a simple task to query a number of pages and read data then save it into a database. ...


Go Back   Usenet Forums > PHP Programming Forums > alt.comp.lang.php

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 06-23-2006
McHenry
 
Posts: n/a
Default web harvesting

I have a simple task to query a number of pages and read data then save it
into a database.
Each page has repeating data similar to a listing of stock quotes where each
pages lists 100 stocks etc.

a) I can query the web and store the page in a variable
b) I can update the database with the data

I cannot work out the best way to process the variable of the web page to
extract the required data, presently it is simply one large string in a
variable.

Any pointers would be greatly appreciated...


Reply With Quote
  #2 (permalink)  
Old 06-23-2006
Arjen
 
Posts: n/a
Default Re: web harvesting

McHenry schreef:
> I have a simple task to query a number of pages and read data then save it
> into a database.
> Each page has repeating data similar to a listing of stock quotes where each
> pages lists 100 stocks etc.
>
> a) I can query the web and store the page in a variable
> b) I can update the database with the data
>
> I cannot work out the best way to process the variable of the web page to
> extract the required data, presently it is simply one large string in a
> variable.
>
> Any pointers would be greatly appreciated...


Nothing wrong with a large string. Use preg_match or so to filer out the
data.

Can you give an example of what page u retrieve and what data u want out
of it ?

arjen
Reply With Quote
  #3 (permalink)  
Old 06-24-2006
McHenry
 
Posts: n/a
Default Re: web harvesting


"Arjen" <dont@mail.me> wrote in message news:e7gqjh$3dl$2@brutus.eur.nl...
> McHenry schreef:
>> I have a simple task to query a number of pages and read data then save
>> it into a database.
>> Each page has repeating data similar to a listing of stock quotes where
>> each pages lists 100 stocks etc.
>>
>> a) I can query the web and store the page in a variable
>> b) I can update the database with the data
>>
>> I cannot work out the best way to process the variable of the web page to
>> extract the required data, presently it is simply one large string in a
>> variable.
>>
>> Any pointers would be greatly appreciated...

>
> Nothing wrong with a large string. Use preg_match or so to filer out the
> data.
>
> Can you give an example of what page u retrieve and what data u want out
> of it ?
>
> arjen


The data is somewhat variable however the following structure is repeated
for each record on the html page.

<div class="Overview">

<div class="header">

***SNIP***

</div>

<div class="content">

***SNIP***

</div>

<div class="break"></div>

</div>



As this structure is repeated over and over for each record I understand I
should use preg_match_all to extract all matches and place them in an array.
I would like to:

a) match the entire pattern and have it stored in array[0][0]

b) match the header component as a parenthesised subpattern and have it
stored in array[1][0]

c) match the content component as a parenthesised subpattern and have it
stored in array[2][0]

Thanks once again...


Reply With Quote
  #4 (permalink)  
Old 06-24-2006
McHenry
 
Posts: n/a
Default Re: web harvesting


"McHenry" <mchenry@mchenry.com> wrote in message
news:449caa3c$0$6668$5a62ac22@per-qv1-newsreader-01.iinet.net.au...
>
> "Arjen" <dont@mail.me> wrote in message news:e7gqjh$3dl$2@brutus.eur.nl...
>> McHenry schreef:
>>> I have a simple task to query a number of pages and read data then save
>>> it into a database.
>>> Each page has repeating data similar to a listing of stock quotes where
>>> each pages lists 100 stocks etc.
>>>
>>> a) I can query the web and store the page in a variable
>>> b) I can update the database with the data
>>>
>>> I cannot work out the best way to process the variable of the web page
>>> to extract the required data, presently it is simply one large string in
>>> a variable.
>>>
>>> Any pointers would be greatly appreciated...

>>
>> Nothing wrong with a large string. Use preg_match or so to filer out the
>> data.
>>
>> Can you give an example of what page u retrieve and what data u want out
>> of it ?
>>
>> arjen

>
> The data is somewhat variable however the following structure is repeated
> for each record on the html page.
>
> <div class="Overview">
>
> <div class="header">
>
> ***SNIP***
>
> </div>
>
> <div class="content">
>
> ***SNIP***
>
> </div>
>
> <div class="break"></div>
>
> </div>
>
>
>
> As this structure is repeated over and over for each record I understand I
> should use preg_match_all to extract all matches and place them in an
> array. I would like to:
>
> a) match the entire pattern and have it stored in array[0][0]
>
> b) match the header component as a parenthesised subpattern and have it
> stored in array[1][0]
>
> c) match the content component as a parenthesised subpattern and have it
> stored in array[2][0]
>
> Thanks once again...
>
>


I have formulated the follow regex... (first regex ever) and it seems to
work when I test it using http://www.regexlib.com/RETester.aspx however when
i try to implement it into my php code it fails:

<div class=\"Overview\">((?s).*)(<div
class=\"header\">((?s).*)</div>)((?s).*)(<div
class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\">


When I try to run the code I receive the following error:
PHP Warning: Unknown modifier '(' in /var/www/html/research/processweb.php
on line 98

$pattern="<div class=\"Overview\">((?s).*)(<div
class=\"header\">((?s).*)</div>)((?s).*)(<div
class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\">";
if (preg_match_all($pattern, $content, $matches,
PREG_PATTERN_ORDER)) {
echo $matches[0][0]."\n";
echo $matches[1][0]."\n";
}


Reply With Quote
  #5 (permalink)  
Old 06-24-2006
Rik
 
Posts: n/a
Default Re: web harvesting

McHenry wrote:
> I have formulated the follow regex... (first regex ever) and it seems
> to work when I test it using http://www.regexlib.com/RETester.aspx
> however when i try to implement it into my php code it fails:
>
> <div class=\"Overview\">((?s).*)(<div
> class=\"header\">((?s).*)</div>)((?s).*)(<div
> class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\">
>
>
> When I try to run the code I receive the following error:
> PHP Warning: Unknown modifier '(' in
> /var/www/html/research/processweb.php on line 98


The first character is taken as delimiter, so your regex stops after
\"Overview\">, and then treats everything as a modifier.
I assume your '***SNIP***'s are the actual content you'd like to obtain?

The Society for Understandable Regular Expressions brings you:
$pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview
.*? #allow random content between starting overview and header
<div[^>]*?class="header"[^>]*?> #start of header
(?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
from the header
</div> #end of header
.*? #once again allow random content
<div[^>]*?class="content"[^>]*?> #start of content
(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
from the content
</div> #end of content
.*? #I am not sure wether you need the code from this point on
<div[^>]*?class="break"[^>]*?></div> #check for break
.*? # some random content
</div> #end of overview
%six';
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);

Some items explained:
% is chosen as delimiter of the regex here. Usually / is chosen, but as this
is HTML it would constantly have to be escaped. Choosing another delimiter
saves work.
[^>]*? allows a div to have other tags besides the classname, so it will
still be picked.
(?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the header/content
div, so still the whole div is matches, not just until the first child div
closes. (?: here means it's a non capturing pattern: we won;t see it back in
$matches, because we don't need it for the match as it is already contained
in the named match.
Modifiers:
s = . matches \n
i = case-insensitice
x = we can use line breaks & comments in our regex to keep it clear

Grtz,
--
Rik Wasmus


Reply With Quote
  #6 (permalink)  
Old 06-24-2006
McHenry
 
Posts: n/a
Default Re: web harvesting

"Rik" <luiheidsgoeroe@hotmail.com> wrote in message
news:c538e$449d1d18$8259c69c$3417@news2.tudelft.nl ...
> McHenry wrote:
>> I have formulated the follow regex... (first regex ever) and it seems
>> to work when I test it using http://www.regexlib.com/RETester.aspx
>> however when i try to implement it into my php code it fails:
>>
>> <div class=\"Overview\">((?s).*)(<div
>> class=\"header\">((?s).*)</div>)((?s).*)(<div
>> class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\">
>>
>>
>> When I try to run the code I receive the following error:
>> PHP Warning: Unknown modifier '(' in
>> /var/www/html/research/processweb.php on line 98

>
> The first character is taken as delimiter, so your regex stops after
> \"Overview\">, and then treats everything as a modifier.
> I assume your '***SNIP***'s are the actual content you'd like to obtain?
>
> The Society for Understandable Regular Expressions brings you:
> $pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview
> .*? #allow random content between starting overview and header
> <div[^>]*?class="header"[^>]*?> #start of header
> (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
> from the header
> </div> #end of header
> .*? #once again allow random content
> <div[^>]*?class="content"[^>]*?> #start of content
> (?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
> from the content
> </div> #end of content
> .*? #I am not sure wether you need the code from this point on
> <div[^>]*?class="break"[^>]*?></div> #check for break
> .*? # some random content
> </div> #end of overview
> %six';
> preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
>
> Some items explained:
> % is chosen as delimiter of the regex here. Usually / is chosen, but as
> this
> is HTML it would constantly have to be escaped. Choosing another delimiter
> saves work.
> [^>]*? allows a div to have other tags besides the classname, so it will
> still be picked.
> (?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the
> header/content
> div, so still the whole div is matches, not just until the first child div
> closes. (?: here means it's a non capturing pattern: we won;t see it back
> in
> $matches, because we don't need it for the match as it is already
> contained
> in the named match.
> Modifiers:
> s = . matches \n
> i = case-insensitice
> x = we can use line breaks & comments in our regex to keep it clear
>
> Grtz,
> --
> Rik Wasmus
>
>



WOW Rik... it's a little different from my attempt :)

Thank you very much as this would have taken me a few... YEARS !

Not to question but I am trying to understand what you have provided and I
am unable to get the pattern to work here for learning purposes:
http://www.regexlib.com/RETester.aspx

Should I not rely on this tool or am I missing something ?

Thanks once again...


Reply With Quote
  #7 (permalink)  
Old 06-24-2006
Rik
 
Posts: n/a
Default Re: web harvesting

McHenry wrote:
> Not to question but I am trying to understand what you have provided
> and I am unable to get the pattern to work here for learning purposes:
> http://www.regexlib.com/RETester.aspx
>

..NET regex is slightly different from PHP's PERL compatible regex. Remove
the comments, delimiters, modifiers, and ?P<name> and usually it's OK.

My favourite tool for decyphering other peoples regexes is Regex Workbench,
which also isn't fully compatible, but mostly get's the job done. This
interprets this pattern as follows:

<div
Any character not in ">"
* (zero or more times) (non-greedy)
class="overview"
Any character not in ">"
* (zero or more times) (non-greedy)
>

.. (any character)
* (zero or more times) (non-greedy)
<div
Any character not in ">"
* (zero or more times) (non-greedy)
class="header"
Any character not in ">"
* (zero or more times) (non-greedy)
>

Capture
. (any character)
* (zero or more times) (non-greedy)
Non-capturing Group
<div
Any character not in ">"
* (zero or more times) (non-greedy)
>

. (any character)
* (zero or more times) (non-greedy)
</div>
. (any character)
* (zero or more times) (non-greedy)
End Capture
* (zero or more times)
End Capture
</div>
.. (any character)
* (zero or more times) (non-greedy)
<div
Any character not in ">"
* (zero or more times) (non-greedy)
class="content"
Any character not in ">"
* (zero or more times) (non-greedy)
>

Capture
. (any character)
* (zero or more times) (non-greedy)
Non-capturing Group
<div
Any character not in ">"
* (zero or more times) (non-greedy)
>

. (any character)
* (zero or more times) (non-greedy)
</div>
. (any character)
* (zero or more times) (non-greedy)
End Capture
* (zero or more times)
End Capture
</div>
.. (any character)
* (zero or more times) (non-greedy)
<div
Any character not in ">"
* (zero or more times) (non-greedy)
class="break"
Any character not in ">"
* (zero or more times) (non-greedy)
></div>

.. (any character)
* (zero or more times) (non-greedy)
</div>

Grtz,
--
Rik Wasmus


Reply With Quote
  #8 (permalink)  
Old 06-24-2006
gerg
 
Posts: n/a
Default Re: web harvesting

Arjen wrote:
> McHenry schreef:
>
>> I have a simple task to query a number of pages and read data then
>> save it into a database.
>> Each page has repeating data similar to a listing of stock quotes
>> where each pages lists 100 stocks etc.
>>
>> a) I can query the web and store the page in a variable
>> b) I can update the database with the data
>>
>> I cannot work out the best way to process the variable of the web page
>> to extract the required data, presently it is simply one large string
>> in a variable.
>>
>> Any pointers would be greatly appreciated...

>
>
> Nothing wrong with a large string. Use preg_match or so to filer out the
> data.
>
> Can you give an example of what page u retrieve and what data u want out
> of it ?
>
> arjen



$string = "This is my big ass string of stocks where each stock is
seperated by a space";

$stocks = explode(" ", $string); // create an array out of each element
of the string, and use the space to tell where each new element of the
array begins and ends.

foreach ($stocks as $stock){

$sql = "INSERT INTO mytable (stock // the name of the field in the
table) VALUES (\"$stock\");
$result = mysql_query($sql);

}

Something like that should work. Basically your just breaking up the
string into one array. If the stocks aren't seperated by a space try to
explode by the \n if they are in a list format.

Good luck!

-g-
Reply With Quote
  #9 (permalink)  
Old 06-25-2006
McHenry
 
Posts: n/a
Default Re: web harvesting


"Rik" <luiheidsgoeroe@hotmail.com> wrote in message
news:c538e$449d1d18$8259c69c$3417@news2.tudelft.nl ...
> McHenry wrote:
>> I have formulated the follow regex... (first regex ever) and it seems
>> to work when I test it using http://www.regexlib.com/RETester.aspx
>> however when i try to implement it into my php code it fails:
>>
>> <div class=\"Overview\">((?s).*)(<div
>> class=\"header\">((?s).*)</div>)((?s).*)(<div
>> class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\">
>>
>>
>> When I try to run the code I receive the following error:
>> PHP Warning: Unknown modifier '(' in
>> /var/www/html/research/processweb.php on line 98

>
> The first character is taken as delimiter, so your regex stops after
> \"Overview\">, and then treats everything as a modifier.
> I assume your '***SNIP***'s are the actual content you'd like to obtain?
>
> The Society for Understandable Regular Expressions brings you:
> $pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview
> .*? #allow random content between starting overview and header
> <div[^>]*?class="header"[^>]*?> #start of header
> (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
> from the header
> </div> #end of header
> .*? #once again allow random content
> <div[^>]*?class="content"[^>]*?> #start of content
> (?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
> from the content
> </div> #end of content
> .*? #I am not sure wether you need the code from this point on
> <div[^>]*?class="break"[^>]*?></div> #check for break
> .*? # some random content
> </div> #end of overview
> %six';
> preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
>
> Some items explained:
> % is chosen as delimiter of the regex here. Usually / is chosen, but as
> this
> is HTML it would constantly have to be escaped. Choosing another delimiter
> saves work.
> [^>]*? allows a div to have other tags besides the classname, so it will
> still be picked.
> (?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the
> header/content
> div, so still the whole div is matches, not just until the first child div
> closes. (?: here means it's a non capturing pattern: we won;t see it back
> in
> $matches, because we don't need it for the match as it is already
> contained
> in the named match.
> Modifiers:
> s = . matches \n
> i = case-insensitice
> x = we can use line breaks & comments in our regex to keep it clear
>
> Grtz,
> --
> Rik Wasmus
>
>


Rik,

This works great however when I try to view the contents of the array I am
only presented with a single element:

Array
(
[0] => Array
(
[0] => <div class="overview">
)

)



Here is the code I am using:

//Extract the content from the page
$pattern='%<div[^>]*?class="overview"[^>]*?> #start of
overview ';
$pattern=$pattern.'.*? #allow
random content between starting overview and header ';
$pattern=$pattern.'<div[^>]*?class="header"[^>]*?> #start of
header ';
$pattern=$pattern.'(?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a
named match from the header ';
$pattern=$pattern.'</div> #end of
header ';
$pattern=$pattern.'.*? #once again
allow random content ';
$pattern=$pattern.'<div[^>]*?class="content"[^>]*?> #start of
content ';
$pattern=$pattern.'(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a
named match from the content ';
$pattern=$pattern.'</div> #end of
content ';
$pattern=$pattern.'.*? #I am not
sure wether you need the code from this point on ';
$pattern=$pattern.'<div[^>]*?class="break"[^>]*?></div> #check for
break ';
$pattern=$pattern.'.*? #some
random content ';
$pattern=$pattern.'</div> #end of
overview ';
$pattern=$pattern.'%six';

if (preg_match_all($pattern, $content, $matches, PREG_PATTERN_ORDER)) {
print_r($matches);
}



Reply With Quote
  #10 (permalink)  
Old 06-25-2006
McHenry
 
Posts: n/a
Default Re: web harvesting


"McHenry" <mchenry@mchenry.com> wrote in message
news:449de5f2$0$6645$5a62ac22@per-qv1-newsreader-01.iinet.net.au...
>
> "Rik" <luiheidsgoeroe@hotmail.com> wrote in message
> news:c538e$449d1d18$8259c69c$3417@news2.tudelft.nl ...
>> McHenry wrote:
>>> I have formulated the follow regex... (first regex ever) and it seems
>>> to work when I test it using http://www.regexlib.com/RETester.aspx
>>> however when i try to implement it into my php code it fails:
>>>
>>> <div class=\"Overview\">((?s).*)(<div
>>> class=\"header\">((?s).*)</div>)((?s).*)(<div
>>> class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\">
>>>
>>>
>>> When I try to run the code I receive the following error:
>>> PHP Warning: Unknown modifier '(' in
>>> /var/www/html/research/processweb.php on line 98

>>
>> The first character is taken as delimiter, so your regex stops after
>> \"Overview\">, and then treats everything as a modifier.
>> I assume your '***SNIP***'s are the actual content you'd like to obtain?
>>
>> The Society for Understandable Regular Expressions brings you:
>> $pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview
>> .*? #allow random content between starting overview and header
>> <div[^>]*?class="header"[^>]*?> #start of header
>> (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
>> from the header
>> </div> #end of header
>> .*? #once again allow random content
>> <div[^>]*?class="content"[^>]*?> #start of content
>> (?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match
>> from the content
>> </div> #end of content
>> .*? #I am not sure wether you need the code from this point on
>> <div[^>]*?class="break"[^>]*?></div> #check for break
>> .*? # some random content
>> </div> #end of overview
>> %six';
>> preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
>>
>> Some items explained:
>> % is chosen as delimiter of the regex here. Usually / is chosen, but as
>> this
>> is HTML it would constantly have to be escaped. Choosing another
>> delimiter
>> saves work.
>> [^>]*? allows a div to have other tags besides the classname, so it will
>> still be picked.
>> (?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the
>> header/content
>> div, so still the whole div is matches, not just until the first child
>> div
>> closes. (?: here means it's a non capturing pattern: we won;t see it back
>> in
>> $matches, because we don't need it for the match as it is already
>> contained
>> in the named match.
>> Modifiers:
>> s = . matches \n
>> i = case-insensitice
>> x = we can use line breaks & comments in our regex to keep it clear
>>
>> Grtz,
>> --
>> Rik Wasmus
>>
>>

>
> Rik,
>
> This works great however when I try to view the contents of the array I am
> only presented with a single element:
>
> Array
> (
> [0] => Array
> (
> [0] => <div class="overview">
> )
>
> )
>
>
>
> Here is the code I am using:
>
> //Extract the content from the page
> $pattern='%<div[^>]*?class="overview"[^>]*?> #start of
> overview ';
> $pattern=$pattern.'.*? #allow
> random content between starting overview and header ';
> $pattern=$pattern.'<div[^>]*?class="header"[^>]*?> #start of
> header ';
> $pattern=$pattern.'(?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a
> named match from the header ';
> $pattern=$pattern.'</div> #end of
> header ';
> $pattern=$pattern.'.*? #once
> again allow random content ';
> $pattern=$pattern.'<div[^>]*?class="content"[^>]*?> #start of
> content ';
> $pattern=$pattern.'(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a
> named match from the content ';
> $pattern=$pattern.'</div> #end of
> content ';
> $pattern=$pattern.'.*? #I am not
> sure wether you need the code from this point on ';
> $pattern=$pattern.'<div[^>]*?class="break"[^>]*?></div> #check
> for break ';
> $pattern=$pattern.'.*? #some
> random content ';
> $pattern=$pattern.'</div> #end of
> overview ';
> $pattern=$pattern.'%six';
>
> if (preg_match_all($pattern, $content, $matches, PREG_PATTERN_ORDER)) {
> print_r($matches);
> }
>


Maybe it should have been obvious but I missed it anyway I removed the
comments from inside the pattern string and it now works.

I love the concept of the named match which makes it very easy to reference
in an array, very powerfull.

Within the header I have a field I would like to capture between
<h1>field_here</h1> I suspected I could achieve this by replacing:
(?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*)

with

(?P<header>.*?(?:<h2[^>]*?>.*?</h2>.*?)*)

however nothing changed when I printed the array value of 'header'?






Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 07:21 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0