This is a discussion on web harvesting within the alt.comp.lang.php forums, part of the PHP Programming Forums category; McHenry wrote: >> This works great however when I try to view the contents of the >> array ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
McHenry wrote:
>> This works great however when I try to view the contents of the >> array I am only presented with a single element: >> Here is the code I am using: >> >> $pattern='%<div[^>]*?class="overview"[^>]*?> #start >> of overview '; >> $pattern=$pattern.'.*? The comment is between # and a newline. As you concat everything in stead of just newlining it inside the quotes, the expressions breaks. Why do you concat by the way? > Maybe it should have been obvious but I missed it anyway I removed the > comments from inside the pattern string and it now works. > > I love the concept of the named match which makes it very easy to > reference in an array, very powerfull. > > Within the header I have a field I would like to capture between > <h1>field_here</h1> I suspected I could achieve this by replacing: > (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) > > with > > (?P<header>.*?(?:<h2[^>]*?>.*?</h2>.*?)*) > > however nothing changed when I printed the array value of 'header'? That's correct behaviour, (:? means a NON capturing pattern. If you only want the <h1> field form the header-div: <div[^>]*?class="header"[^>]*> .*?(:?<div[^>]*>.*?</div>.*?)*? <h1>(?P<header>.*?)</h1> .*?(:?<div[^>]*>.*?</div>.*?)*? </div> If you want the whole header-div and the h2-field again in a seperate div: <div[^>]*?class="header"[^>]*> (?P<header>.*?(:?<div[^>]*>.*?</div>.*?)*? <h1>(?P<h1>.*?)</h1> .*?(:?<div[^>]*>.*?</div>.*?)*?) </div> Grtz, -- Rik Wasmus |
|
|||
|
"Rik" <luiheidsgoeroe@hotmail.com> wrote in message news:aac61$449e3a5b$8259c69c$14679@news2.tudelft.n l... > McHenry wrote: >>> This works great however when I try to view the contents of the >>> array I am only presented with a single element: > >>> Here is the code I am using: >>> >>> $pattern='%<div[^>]*?class="overview"[^>]*?> #start >>> of overview '; >>> $pattern=$pattern.'.*? > > The comment is between # and a newline. As you concat everything in stead > of > just newlining it inside the quotes, the expressions breaks. Why do you > concat by the way? I thought this was the way I had to do it... (new to php, new to Linux, new to many things) Now I understand, I thought the comments were part of the regex and couldn't understand how it worked... :) > >> Maybe it should have been obvious but I missed it anyway I removed the >> comments from inside the pattern string and it now works. >> >> I love the concept of the named match which makes it very easy to >> reference in an array, very powerfull. >> >> Within the header I have a field I would like to capture between >> <h1>field_here</h1> I suspected I could achieve this by replacing: >> (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*) >> >> with >> >> (?P<header>.*?(?:<h2[^>]*?>.*?</h2>.*?)*) >> >> however nothing changed when I printed the array value of 'header'? > > That's correct behaviour, (:? means a NON capturing pattern. Your original solution used (?: not (:? is there a difference or is this a typo ? > > If you only want the <h1> field form the header-div: > > <div[^>]*?class="header"[^>]*> > .*?(:?<div[^>]*>.*?</div>.*?)*? > <h1>(?P<header>.*?)</h1> > .*?(:?<div[^>]*>.*?</div>.*?)*? > </div> Why do you use a ? after a * I would have thought the usage of these would be mutually exclusive, for example my understanding of <div[^>]*?class="header"[^>]*> is: match the pattern <div match any character other than > match 0 or more of the previous expression match 0 or 1 of the previous expression match the pattern class="header" match any character other than > match 0 or more of the previous expression match the pattern > I appreciate your assistance... > > > If you want the whole header-div and the h2-field again in a seperate div: > <div[^>]*?class="header"[^>]*> > (?P<header>.*?(:?<div[^>]*>.*?</div>.*?)*? > <h1>(?P<h1>.*?)</h1> > .*?(:?<div[^>]*>.*?</div>.*?)*?) > </div> > > Grtz, > -- > Rik Wasmus > > |
|
|||
|
McHenry wrote:
>> The comment is between # and a newline. As you concat everything in >> stead of >> just newlining it inside the quotes, the expressions breaks. Why do >> you concat by the way? > > I thought this was the way I had to do it... (new to php, new to > Linux, new to many things) > Now I understand, I thought the comments were part of the regex and > couldn't understand how it worked... :) Hehe, yeah, then it get's tricky :-). >> That's correct behaviour, (:? means a NON capturing pattern. > > Your original solution used (?: not (:? is there a difference or is > this a typo ? Typo, should be (?:, (:? would mean 'capture a ":" zero or one time' :-) >> If you only want the <h1> field form the header-div: >> >> <div[^>]*?class="header"[^>]*> >> .*?(:?<div[^>]*>.*?</div>.*?)*? >> <h1>(?P<header>.*?)</h1> >> .*?(:?<div[^>]*>.*?</div>.*?)*? >> </div> > > Why do you use a ? after a * I would have thought the usage of these > would be mutually exclusive, for example my understanding of > *? > match 0 or more of the previous expression > match 0 or 1 of the previous expression Nope, a ? after a * makes it non-greedy. It will give you back the shortest match possible, instead of the longest. To illustrate, say we want to capture the contents of the following divs: $string = '<div>something</div><div>something else</div>'; preg_match_all('%<div>(.*)</div>%',$string,$match1); preg_match_all('%<div>(.*?)</div>%',$string,$match2); print_r($match1); print_r($match2); Will give: Array ( [0] => Array ( [0] => <div>something</div><div>something else</div> ) [1] => Array ( [0] => something</div><div>something else ) ) Array ( [0] => Array ( [0] => <div>something</div> [1] => <div>something else</div> ) [1] => Array ( [0] => something [1] => something else ) ) -- Rik Wasmus |
|
|||
|
"Rik" <luiheidsgoeroe@hotmail.com> wrote in message news:8b14a$449ebfe2$8259c69c$19227@news2.tudelft.n l... > McHenry wrote: >>> The comment is between # and a newline. As you concat everything in >>> stead of >>> just newlining it inside the quotes, the expressions breaks. Why do >>> you concat by the way? >> >> I thought this was the way I had to do it... (new to php, new to >> Linux, new to many things) >> Now I understand, I thought the comments were part of the regex and >> couldn't understand how it worked... :) > > Hehe, yeah, then it get's tricky :-). > >>> That's correct behaviour, (:? means a NON capturing pattern. >> >> Your original solution used (?: not (:? is there a difference or is >> this a typo ? > > Typo, should be (?:, (:? would mean 'capture a ":" zero or one time' :-) > >>> If you only want the <h1> field form the header-div: >>> >>> <div[^>]*?class="header"[^>]*> >>> .*?(:?<div[^>]*>.*?</div>.*?)*? >>> <h1>(?P<header>.*?)</h1> >>> .*?(:?<div[^>]*>.*?</div>.*?)*? >>> </div> >> >> Why do you use a ? after a * I would have thought the usage of these >> would be mutually exclusive, for example my understanding of >> *? > > >> match 0 or more of the previous expression >> match 0 or 1 of the previous expression > > Nope, a ? after a * makes it non-greedy. It will give you back the > shortest > match possible, instead of the longest. > > To illustrate, say we want to capture the contents of the following divs: > $string = '<div>something</div><div>something else</div>'; > > preg_match_all('%<div>(.*)</div>%',$string,$match1); > preg_match_all('%<div>(.*?)</div>%',$string,$match2); > > print_r($match1); > print_r($match2); > > Will give: > Array > ( > [0] => Array > ( > [0] => <div>something</div><div>something else</div> > ) > > [1] => Array > ( > [0] => something</div><div>something else > ) > > ) > Array > ( > [0] => Array > ( > [0] => <div>something</div> > [1] => <div>something else</div> > ) > > [1] => Array > ( > [0] => something > [1] => something else > ) > > ) > > > -- > Rik Wasmus > > Rik, When I implement either of the two options above the regex stops working ? $pattern='%<div[^>]*?class="overview"[^>]*?> #start of overview .*? #allow random content between starting overview and header <div[^>]*?class="header"[^>]*> .*? (?:<div[^>]*>.*?</div>.*?)*? <h1>(?P<header>.*?)</h1> .*? (?:<div[^>]*>.*?</div>.*?)*? </div> .*? #once again allow random content <div[^>]*?class="content"[^>]*?> #start of content (?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*) #get a named match from the content </div> #end of content .*? #I am not sure wether you need the code from this point on <div[^>]*?class="break"[^>]*?></div> #check for break .*? #some random content </div> #end of overview %six'; I am trying to comprehend these expressions so I can solve them myself and not trouble yourself however there are either very complex regexs or I am a very slow learner... most likely the second :) My breakdown and understanding of the regex above is: <div[^>]*?class="overview"[^>]*?> #Match the start of the overview ======================================== match the string: <div match any character other than > match 0 or more of the prev expressions only until the first occurrance of the next match is found (non greedy) match the string: class="overview" match any character other than > match 0 or more of the prev expressions only until the first occurrance of the next match is found (non greedy) match the string: > ..*? #Match any content between the overview and header ======================================== match any character match 0 or more of the prev expressions only until the first occurrance of the next match is found (non greedy) <div[^>]*?class="header"[^>]*> #Match the header ======================================== match the string: <div match any character other than > match 0 or more of the prev expressions only until the first occurrance of the next match is found (non greedy) match the string: class="header" match any character other than > match 0 or more of the prev expressions until the last occurrance of the next match is found (greedy) match the string: > (?:<div[^>]*>.*?</div>.*?)*? #Does this eliminate nested divs within the header div ? ======================================== Non capturing pattern match the string: <div match 0 or more of the prev expressions until the last occurrance of the next match is found (greedy) match the string: > match any character match 0 or more of the prev expressions only until the first occurrance of the next match is found (non greedy) match the string: </div> match any character match 0 or more of the prev expressions only until the first occurrance of the next match is found (non greedy) match 0 or more of the prev expression in brackets only until the first occurrance of the next match is found (non greedy) <h1>(?P<header>.*?)</h1> #Match the contents between the h1 tags ======================================== match the string: <h1> caputure all chars only until the first occurrance of the next match is found (non greedy) and name the subpattern match the string: <h2> Thanks for all your help so far and I think I'm getting there... |