This is a discussion on web harvesting within the alt.comp.lang.php forums, part of the PHP Programming Forums category; I have a simple task to query a number of pages and read data then save it into a database. ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
I have a simple task to query a number of pages and read data then save it
into a database. Each page has repeating data similar to a listing of stock quotes where each pages lists 100 stocks etc. a) I can query the web and store the page in a variable b) I can update the database with the data I cannot work out the best way to process the variable of the web page to extract the required data, presently it is simply one large string in a variable. Any pointers would be greatly appreciated... |
|
|||
|
McHenry schreef:
> I have a simple task to query a number of pages and read data then save it > into a database. > Each page has repeating data similar to a listing of stock quotes where each > pages lists 100 stocks etc. > > a) I can query the web and store the page in a variable > b) I can update the database with the data > > I cannot work out the best way to process the variable of the web page to > extract the required data, presently it is simply one large string in a > variable. > > Any pointers would be greatly appreciated... Nothing wrong with a large string. Use preg_match or so to filer out the data. Can you give an example of what page u retrieve and what data u want out of it ? arjen |
|
|||
|
"Arjen" <dont@mail.me> wrote in message news:e7gqjh$3dl$2@brutus.eur.nl... > McHenry schreef: >> I have a simple task to query a number of pages and read data then save >> it into a database. >> Each page has repeating data similar to a listing of stock quotes where >> each pages lists 100 stocks etc. >> >> a) I can query the web and store the page in a variable >> b) I can update the database with the data >> >> I cannot work out the best way to process the variable of the web page to >> extract the required data, presently it is simply one large string in a >> variable. >> >> Any pointers would be greatly appreciated... > > Nothing wrong with a large string. Use preg_match or so to filer out the > data. > > Can you give an example of what page u retrieve and what data u want out > of it ? > > arjen The data is somewhat variable however the following structure is repeated for each record on the html page. <div class="Overview"> <div class="header"> ***SNIP*** </div> <div class="content"> ***SNIP*** </div> <div class="break"></div> </div> As this structure is repeated over and over for each record I understand I should use preg_match_all to extract all matches and place them in an array. I would like to: a) match the entire pattern and have it stored in array[0][0] b) match the header component as a parenthesised subpattern and have it stored in array[1][0] c) match the content component as a parenthesised subpattern and have it stored in array[2][0] Thanks once again... |
|
|||
|
"McHenry" <mchenry@mchenry.com> wrote in message news:449caa3c$0$6668$5a62ac22@per-qv1-newsreader-01.iinet.net.au... > > "Arjen" <dont@mail.me> wrote in message news:e7gqjh$3dl$2@brutus.eur.nl... >> McHenry schreef: >>> I have a simple task to query a number of pages and read data then save >>> it into a database. >>> Each page has repeating data similar to a listing of stock quotes where >>> each pages lists 100 stocks etc. >>> >>> a) I can query the web and store the page in a variable >>> b) I can update the database with the data >>> >>> I cannot work out the best way to process the variable of the web page >>> to extract the required data, presently it is simply one large string in >>> a variable. >>> >>> Any pointers would be greatly appreciated... >> >> Nothing wrong with a large string. Use preg_match or so to filer out the >> data. >> >> Can you give an example of what page u retrieve and what data u want out >> of it ? >> >> arjen > > The data is somewhat variable however the following structure is repeated > for each record on the html page. > > <div class="Overview"> > > <div class="header"> > > ***SNIP*** > > </div> > > <div class="content"> > > ***SNIP*** > > </div> > > <div class="break"></div> > > </div> > > > > As this structure is repeated over and over for each record I understand I > should use preg_match_all to extract all matches and place them in an > array. I would like to: > > a) match the entire pattern and have it stored in array[0][0] > > b) match the header component as a parenthesised subpattern and have it > stored in array[1][0] > > c) match the content component as a parenthesised subpattern and have it > stored in array[2][0] > > Thanks once again... > > I have formulated the follow regex... (first regex ever) and it seems to work when I test it using http://www.regexlib.com/RETester.aspx however when i try to implement it into my php code it fails: <div class=\"Overview\">((?s).*)(<div class=\"header\">((?s).*)</div>)((?s).*)(<div class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\"> When I try to run the code I receive the following error: PHP Warning: Unknown modifier '(' in /var/www/html/research/processweb.php on line 98 $pattern="<div class=\"Overview\">((?s).*)(<div class=\"header\">((?s).*)</div>)((?s).*)(<div class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\">"; if (preg_match_all($pattern, $content, $matches, PREG_PATTERN_ORDER)) { echo $matches[0][0]."\n"; echo $matches[1][0]."\n"; } |
|
|||
|
McHenry wrote:
> I have formulated the follow regex... (first regex ever) and it seems > to work when I test it using http://www.regexlib.com/RETester.aspx > however when i try to implement it into my php code it fails: > > <div class=\"Overview\">((?s).*)(<div > class=\"header\">((?s).*)</div>)((?s).*)(<div > class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\"> > > > When I try to run the code I receive the following error: > PHP Warning: Unknown modifier '(' in > /var/www/html/research/processweb.php on line 98 The first character is taken as delimiter, so your regex stops after \"Overview\">, and then treats everything as a modifier. I assume your '***SNIP***'s are the actual content you'd like to obtain? The Society for Understandable Regular Expressions brings you: $pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview .*? #allow random content between starting overview and header <div[^>]*?class="header"[^>]*?> #start of header (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match from the header </div> #end of header .*? #once again allow random content <div[^>]*?class="content"[^>]*?> #start of content (?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match from the content </div> #end of content .*? #I am not sure wether you need the code from this point on <div[^>]*?class="break"[^>]*?></div> #check for break .*? # some random content </div> #end of overview %six'; preg_match_all($pattern, $content, $matches, PREG_SET_ORDER); Some items explained: % is chosen as delimiter of the regex here. Usually / is chosen, but as this is HTML it would constantly have to be escaped. Choosing another delimiter saves work. [^>]*? allows a div to have other tags besides the classname, so it will still be picked. (?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the header/content div, so still the whole div is matches, not just until the first child div closes. (?: here means it's a non capturing pattern: we won;t see it back in $matches, because we don't need it for the match as it is already contained in the named match. Modifiers: s = . matches \n i = case-insensitice x = we can use line breaks & comments in our regex to keep it clear Grtz, -- Rik Wasmus |
|
|||
|
"Rik" <luiheidsgoeroe@hotmail.com> wrote in message
news:c538e$449d1d18$8259c69c$3417@news2.tudelft.nl ... > McHenry wrote: >> I have formulated the follow regex... (first regex ever) and it seems >> to work when I test it using http://www.regexlib.com/RETester.aspx >> however when i try to implement it into my php code it fails: >> >> <div class=\"Overview\">((?s).*)(<div >> class=\"header\">((?s).*)</div>)((?s).*)(<div >> class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\"> >> >> >> When I try to run the code I receive the following error: >> PHP Warning: Unknown modifier '(' in >> /var/www/html/research/processweb.php on line 98 > > The first character is taken as delimiter, so your regex stops after > \"Overview\">, and then treats everything as a modifier. > I assume your '***SNIP***'s are the actual content you'd like to obtain? > > The Society for Understandable Regular Expressions brings you: > $pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview > .*? #allow random content between starting overview and header > <div[^>]*?class="header"[^>]*?> #start of header > (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match > from the header > </div> #end of header > .*? #once again allow random content > <div[^>]*?class="content"[^>]*?> #start of content > (?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match > from the content > </div> #end of content > .*? #I am not sure wether you need the code from this point on > <div[^>]*?class="break"[^>]*?></div> #check for break > .*? # some random content > </div> #end of overview > %six'; > preg_match_all($pattern, $content, $matches, PREG_SET_ORDER); > > Some items explained: > % is chosen as delimiter of the regex here. Usually / is chosen, but as > this > is HTML it would constantly have to be escaped. Choosing another delimiter > saves work. > [^>]*? allows a div to have other tags besides the classname, so it will > still be picked. > (?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the > header/content > div, so still the whole div is matches, not just until the first child div > closes. (?: here means it's a non capturing pattern: we won;t see it back > in > $matches, because we don't need it for the match as it is already > contained > in the named match. > Modifiers: > s = . matches \n > i = case-insensitice > x = we can use line breaks & comments in our regex to keep it clear > > Grtz, > -- > Rik Wasmus > > WOW Rik... it's a little different from my attempt :) Thank you very much as this would have taken me a few... YEARS ! Not to question but I am trying to understand what you have provided and I am unable to get the pattern to work here for learning purposes: http://www.regexlib.com/RETester.aspx Should I not rely on this tool or am I missing something ? Thanks once again... |
|
|||
|
McHenry wrote:
> Not to question but I am trying to understand what you have provided > and I am unable to get the pattern to work here for learning purposes: > http://www.regexlib.com/RETester.aspx > ..NET regex is slightly different from PHP's PERL compatible regex. Remove the comments, delimiters, modifiers, and ?P<name> and usually it's OK. My favourite tool for decyphering other peoples regexes is Regex Workbench, which also isn't fully compatible, but mostly get's the job done. This interprets this pattern as follows: <div Any character not in ">" * (zero or more times) (non-greedy) class="overview" Any character not in ">" * (zero or more times) (non-greedy) > .. (any character) * (zero or more times) (non-greedy) <div Any character not in ">" * (zero or more times) (non-greedy) class="header" Any character not in ">" * (zero or more times) (non-greedy) > Capture . (any character) * (zero or more times) (non-greedy) Non-capturing Group <div Any character not in ">" * (zero or more times) (non-greedy) > . (any character) * (zero or more times) (non-greedy) </div> . (any character) * (zero or more times) (non-greedy) End Capture * (zero or more times) End Capture </div> .. (any character) * (zero or more times) (non-greedy) <div Any character not in ">" * (zero or more times) (non-greedy) class="content" Any character not in ">" * (zero or more times) (non-greedy) > Capture . (any character) * (zero or more times) (non-greedy) Non-capturing Group <div Any character not in ">" * (zero or more times) (non-greedy) > . (any character) * (zero or more times) (non-greedy) </div> . (any character) * (zero or more times) (non-greedy) End Capture * (zero or more times) End Capture </div> .. (any character) * (zero or more times) (non-greedy) <div Any character not in ">" * (zero or more times) (non-greedy) class="break" Any character not in ">" * (zero or more times) (non-greedy) ></div> .. (any character) * (zero or more times) (non-greedy) </div> Grtz, -- Rik Wasmus |
|
|||
|
Arjen wrote:
> McHenry schreef: > >> I have a simple task to query a number of pages and read data then >> save it into a database. >> Each page has repeating data similar to a listing of stock quotes >> where each pages lists 100 stocks etc. >> >> a) I can query the web and store the page in a variable >> b) I can update the database with the data >> >> I cannot work out the best way to process the variable of the web page >> to extract the required data, presently it is simply one large string >> in a variable. >> >> Any pointers would be greatly appreciated... > > > Nothing wrong with a large string. Use preg_match or so to filer out the > data. > > Can you give an example of what page u retrieve and what data u want out > of it ? > > arjen $string = "This is my big ass string of stocks where each stock is seperated by a space"; $stocks = explode(" ", $string); // create an array out of each element of the string, and use the space to tell where each new element of the array begins and ends. foreach ($stocks as $stock){ $sql = "INSERT INTO mytable (stock // the name of the field in the table) VALUES (\"$stock\"); $result = mysql_query($sql); } Something like that should work. Basically your just breaking up the string into one array. If the stocks aren't seperated by a space try to explode by the \n if they are in a list format. Good luck! -g- |
|
|||
|
"Rik" <luiheidsgoeroe@hotmail.com> wrote in message news:c538e$449d1d18$8259c69c$3417@news2.tudelft.nl ... > McHenry wrote: >> I have formulated the follow regex... (first regex ever) and it seems >> to work when I test it using http://www.regexlib.com/RETester.aspx >> however when i try to implement it into my php code it fails: >> >> <div class=\"Overview\">((?s).*)(<div >> class=\"header\">((?s).*)</div>)((?s).*)(<div >> class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\"> >> >> >> When I try to run the code I receive the following error: >> PHP Warning: Unknown modifier '(' in >> /var/www/html/research/processweb.php on line 98 > > The first character is taken as delimiter, so your regex stops after > \"Overview\">, and then treats everything as a modifier. > I assume your '***SNIP***'s are the actual content you'd like to obtain? > > The Society for Understandable Regular Expressions brings you: > $pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview > .*? #allow random content between starting overview and header > <div[^>]*?class="header"[^>]*?> #start of header > (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match > from the header > </div> #end of header > .*? #once again allow random content > <div[^>]*?class="content"[^>]*?> #start of content > (?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match > from the content > </div> #end of content > .*? #I am not sure wether you need the code from this point on > <div[^>]*?class="break"[^>]*?></div> #check for break > .*? # some random content > </div> #end of overview > %six'; > preg_match_all($pattern, $content, $matches, PREG_SET_ORDER); > > Some items explained: > % is chosen as delimiter of the regex here. Usually / is chosen, but as > this > is HTML it would constantly have to be escaped. Choosing another delimiter > saves work. > [^>]*? allows a div to have other tags besides the classname, so it will > still be picked. > (?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the > header/content > div, so still the whole div is matches, not just until the first child div > closes. (?: here means it's a non capturing pattern: we won;t see it back > in > $matches, because we don't need it for the match as it is already > contained > in the named match. > Modifiers: > s = . matches \n > i = case-insensitice > x = we can use line breaks & comments in our regex to keep it clear > > Grtz, > -- > Rik Wasmus > > Rik, This works great however when I try to view the contents of the array I am only presented with a single element: Array ( [0] => Array ( [0] => <div class="overview"> ) ) Here is the code I am using: //Extract the content from the page $pattern='%<div[^>]*?class="overview"[^>]*?> #start of overview '; $pattern=$pattern.'.*? #allow random content between starting overview and header '; $pattern=$pattern.'<div[^>]*?class="header"[^>]*?> #start of header '; $pattern=$pattern.'(?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match from the header '; $pattern=$pattern.'</div> #end of header '; $pattern=$pattern.'.*? #once again allow random content '; $pattern=$pattern.'<div[^>]*?class="content"[^>]*?> #start of content '; $pattern=$pattern.'(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match from the content '; $pattern=$pattern.'</div> #end of content '; $pattern=$pattern.'.*? #I am not sure wether you need the code from this point on '; $pattern=$pattern.'<div[^>]*?class="break"[^>]*?></div> #check for break '; $pattern=$pattern.'.*? #some random content '; $pattern=$pattern.'</div> #end of overview '; $pattern=$pattern.'%six'; if (preg_match_all($pattern, $content, $matches, PREG_PATTERN_ORDER)) { print_r($matches); } |
|
|||
|
"McHenry" <mchenry@mchenry.com> wrote in message news:449de5f2$0$6645$5a62ac22@per-qv1-newsreader-01.iinet.net.au... > > "Rik" <luiheidsgoeroe@hotmail.com> wrote in message > news:c538e$449d1d18$8259c69c$3417@news2.tudelft.nl ... >> McHenry wrote: >>> I have formulated the follow regex... (first regex ever) and it seems >>> to work when I test it using http://www.regexlib.com/RETester.aspx >>> however when i try to implement it into my php code it fails: >>> >>> <div class=\"Overview\">((?s).*)(<div >>> class=\"header\">((?s).*)</div>)((?s).*)(<div >>> class=\"content\">((?s).*)</div>)((?s).*)<div class=\"break\"> >>> >>> >>> When I try to run the code I receive the following error: >>> PHP Warning: Unknown modifier '(' in >>> /var/www/html/research/processweb.php on line 98 >> >> The first character is taken as delimiter, so your regex stops after >> \"Overview\">, and then treats everything as a modifier. >> I assume your '***SNIP***'s are the actual content you'd like to obtain? >> >> The Society for Understandable Regular Expressions brings you: >> $pattern = '%<div[^>]*?class="overview"[^>]*?> #start of overview >> .*? #allow random content between starting overview and header >> <div[^>]*?class="header"[^>]*?> #start of header >> (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match >> from the header >> </div> #end of header >> .*? #once again allow random content >> <div[^>]*?class="content"[^>]*?> #start of content >> (?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match >> from the content >> </div> #end of content >> .*? #I am not sure wether you need the code from this point on >> <div[^>]*?class="break"[^>]*?></div> #check for break >> .*? # some random content >> </div> #end of overview >> %six'; >> preg_match_all($pattern, $content, $matches, PREG_SET_ORDER); >> >> Some items explained: >> % is chosen as delimiter of the regex here. Usually / is chosen, but as >> this >> is HTML it would constantly have to be escaped. Choosing another >> delimiter >> saves work. >> [^>]*? allows a div to have other tags besides the classname, so it will >> still be picked. >> (?:<div[^>]*?>.*?</div>.*?)* allows div's to be nested in the >> header/content >> div, so still the whole div is matches, not just until the first child >> div >> closes. (?: here means it's a non capturing pattern: we won;t see it back >> in >> $matches, because we don't need it for the match as it is already >> contained >> in the named match. >> Modifiers: >> s = . matches \n >> i = case-insensitice >> x = we can use line breaks & comments in our regex to keep it clear >> >> Grtz, >> -- >> Rik Wasmus >> >> > > Rik, > > This works great however when I try to view the contents of the array I am > only presented with a single element: > > Array > ( > [0] => Array > ( > [0] => <div class="overview"> > ) > > ) > > > > Here is the code I am using: > > //Extract the content from the page > $pattern='%<div[^>]*?class="overview"[^>]*?> #start of > overview '; > $pattern=$pattern.'.*? #allow > random content between starting overview and header '; > $pattern=$pattern.'<div[^>]*?class="header"[^>]*?> #start of > header '; > $pattern=$pattern.'(?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a > named match from the header '; > $pattern=$pattern.'</div> #end of > header '; > $pattern=$pattern.'.*? #once > again allow random content '; > $pattern=$pattern.'<div[^>]*?class="content"[^>]*?> #start of > content '; > $pattern=$pattern.'(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a > named match from the content '; > $pattern=$pattern.'</div> #end of > content '; > $pattern=$pattern.'.*? #I am not > sure wether you need the code from this point on '; > $pattern=$pattern.'<div[^>]*?class="break"[^>]*?></div> #check > for break '; > $pattern=$pattern.'.*? #some > random content '; > $pattern=$pattern.'</div> #end of > overview '; > $pattern=$pattern.'%six'; > > if (preg_match_all($pattern, $content, $matches, PREG_PATTERN_ORDER)) { > print_r($matches); > } > Maybe it should have been obvious but I missed it anyway I removed the comments from inside the pattern string and it now works. I love the concept of the named match which makes it very easy to reference in an array, very powerfull. Within the header I have a field I would like to capture between <h1>field_here</h1> I suspected I could achieve this by replacing: (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) with (?P<header>.*?(?:<h2[^>]*?>.*?</h2>.*?)*) however nothing changed when I printed the array value of 'header'? |