Bluehost.com Web Hosting $6.95

web harvesting

This is a discussion on web harvesting within the alt.comp.lang.php forums, part of the PHP Programming Forums category; McHenry wrote: >> This works great however when I try to view the contents of the >> array ...


Go Back   Usenet Forums > PHP Programming Forums > alt.comp.lang.php

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 06-25-2006
Rik
 
Posts: n/a
Default Re: web harvesting

McHenry wrote:
>> This works great however when I try to view the contents of the
>> array I am only presented with a single element:


>> Here is the code I am using:
>>
>> $pattern='%<div[^>]*?class="overview"[^>]*?> #start
>> of overview ';
>> $pattern=$pattern.'.*?


The comment is between # and a newline. As you concat everything in stead of
just newlining it inside the quotes, the expressions breaks. Why do you
concat by the way?

> Maybe it should have been obvious but I missed it anyway I removed the
> comments from inside the pattern string and it now works.
>
> I love the concept of the named match which makes it very easy to
> reference in an array, very powerfull.
>
> Within the header I have a field I would like to capture between
> <h1>field_here</h1> I suspected I could achieve this by replacing:
> (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*)
>
> with
>
> (?P<header>.*?(?:<h2[^>]*?>.*?</h2>.*?)*)
>
> however nothing changed when I printed the array value of 'header'?


That's correct behaviour, (:? means a NON capturing pattern.

If you only want the <h1> field form the header-div:

<div[^>]*?class="header"[^>]*>
.*?(:?<div[^>]*>.*?</div>.*?)*?
<h1>(?P<header>.*?)</h1>
.*?(:?<div[^>]*>.*?</div>.*?)*?
</div>


If you want the whole header-div and the h2-field again in a seperate div:
<div[^>]*?class="header"[^>]*>
(?P<header>.*?(:?<div[^>]*>.*?</div>.*?)*?
<h1>(?P<h1>.*?)</h1>
.*?(:?<div[^>]*>.*?</div>.*?)*?)
</div>

Grtz,
--
Rik Wasmus


Reply With Quote
  #12 (permalink)  
Old 06-25-2006
McHenry
 
Posts: n/a
Default Re: web harvesting


"Rik" <luiheidsgoeroe@hotmail.com> wrote in message
news:aac61$449e3a5b$8259c69c$14679@news2.tudelft.n l...
> McHenry wrote:
>>> This works great however when I try to view the contents of the
>>> array I am only presented with a single element:

>
>>> Here is the code I am using:
>>>
>>> $pattern='%<div[^>]*?class="overview"[^>]*?> #start
>>> of overview ';
>>> $pattern=$pattern.'.*?

>
> The comment is between # and a newline. As you concat everything in stead
> of
> just newlining it inside the quotes, the expressions breaks. Why do you
> concat by the way?


I thought this was the way I had to do it... (new to php, new to Linux, new
to many things)
Now I understand, I thought the comments were part of the regex and couldn't
understand how it worked... :)

>
>> Maybe it should have been obvious but I missed it anyway I removed the
>> comments from inside the pattern string and it now works.
>>
>> I love the concept of the named match which makes it very easy to
>> reference in an array, very powerfull.
>>
>> Within the header I have a field I would like to capture between
>> <h1>field_here</h1> I suspected I could achieve this by replacing:
>> (?P<header>.*?(?:<div[^>]*?>.*?</div>.*?)*)
>>
>> with
>>
>> (?P<header>.*?(?:<h2[^>]*?>.*?</h2>.*?)*)
>>
>> however nothing changed when I printed the array value of 'header'?

>
> That's correct behaviour, (:? means a NON capturing pattern.


Your original solution used (?: not (:? is there a difference or is this a
typo ?

>
> If you only want the <h1> field form the header-div:
>
> <div[^>]*?class="header"[^>]*>
> .*?(:?<div[^>]*>.*?</div>.*?)*?
> <h1>(?P<header>.*?)</h1>
> .*?(:?<div[^>]*>.*?</div>.*?)*?
> </div>


Why do you use a ? after a * I would have thought the usage of these would
be mutually exclusive, for example my understanding of
<div[^>]*?class="header"[^>]*> is:

match the pattern <div
match any character other than >
match 0 or more of the previous expression
match 0 or 1 of the previous expression
match the pattern class="header"
match any character other than >
match 0 or more of the previous expression
match the pattern >

I appreciate your assistance...

>
>
> If you want the whole header-div and the h2-field again in a seperate div:
> <div[^>]*?class="header"[^>]*>
> (?P<header>.*?(:?<div[^>]*>.*?</div>.*?)*?
> <h1>(?P<h1>.*?)</h1>
> .*?(:?<div[^>]*>.*?</div>.*?)*?)
> </div>
>
> Grtz,
> --
> Rik Wasmus
>
>



Reply With Quote
  #13 (permalink)  
Old 06-25-2006
Rik
 
Posts: n/a
Default Re: web harvesting

McHenry wrote:
>> The comment is between # and a newline. As you concat everything in
>> stead of
>> just newlining it inside the quotes, the expressions breaks. Why do
>> you concat by the way?

>
> I thought this was the way I had to do it... (new to php, new to
> Linux, new to many things)
> Now I understand, I thought the comments were part of the regex and
> couldn't understand how it worked... :)


Hehe, yeah, then it get's tricky :-).

>> That's correct behaviour, (:? means a NON capturing pattern.

>
> Your original solution used (?: not (:? is there a difference or is
> this a typo ?


Typo, should be (?:, (:? would mean 'capture a ":" zero or one time' :-)

>> If you only want the <h1> field form the header-div:
>>
>> <div[^>]*?class="header"[^>]*>
>> .*?(:?<div[^>]*>.*?</div>.*?)*?
>> <h1>(?P<header>.*?)</h1>
>> .*?(:?<div[^>]*>.*?</div>.*?)*?
>> </div>

>
> Why do you use a ? after a * I would have thought the usage of these
> would be mutually exclusive, for example my understanding of
> *?



> match 0 or more of the previous expression
> match 0 or 1 of the previous expression


Nope, a ? after a * makes it non-greedy. It will give you back the shortest
match possible, instead of the longest.

To illustrate, say we want to capture the contents of the following divs:
$string = '<div>something</div><div>something else</div>';

preg_match_all('%<div>(.*)</div>%',$string,$match1);
preg_match_all('%<div>(.*?)</div>%',$string,$match2);

print_r($match1);
print_r($match2);

Will give:
Array
(
[0] => Array
(
[0] => <div>something</div><div>something else</div>
)

[1] => Array
(
[0] => something</div><div>something else
)

)
Array
(
[0] => Array
(
[0] => <div>something</div>
[1] => <div>something else</div>
)

[1] => Array
(
[0] => something
[1] => something else
)

)


--
Rik Wasmus


Reply With Quote
  #14 (permalink)  
Old 06-26-2006
McHenry
 
Posts: n/a
Default Re: web harvesting


"Rik" <luiheidsgoeroe@hotmail.com> wrote in message
news:8b14a$449ebfe2$8259c69c$19227@news2.tudelft.n l...
> McHenry wrote:
>>> The comment is between # and a newline. As you concat everything in
>>> stead of
>>> just newlining it inside the quotes, the expressions breaks. Why do
>>> you concat by the way?

>>
>> I thought this was the way I had to do it... (new to php, new to
>> Linux, new to many things)
>> Now I understand, I thought the comments were part of the regex and
>> couldn't understand how it worked... :)

>
> Hehe, yeah, then it get's tricky :-).
>
>>> That's correct behaviour, (:? means a NON capturing pattern.

>>
>> Your original solution used (?: not (:? is there a difference or is
>> this a typo ?

>
> Typo, should be (?:, (:? would mean 'capture a ":" zero or one time' :-)
>
>>> If you only want the <h1> field form the header-div:
>>>
>>> <div[^>]*?class="header"[^>]*>
>>> .*?(:?<div[^>]*>.*?</div>.*?)*?
>>> <h1>(?P<header>.*?)</h1>
>>> .*?(:?<div[^>]*>.*?</div>.*?)*?
>>> </div>

>>
>> Why do you use a ? after a * I would have thought the usage of these
>> would be mutually exclusive, for example my understanding of
>> *?

>
>
>> match 0 or more of the previous expression
>> match 0 or 1 of the previous expression

>
> Nope, a ? after a * makes it non-greedy. It will give you back the
> shortest
> match possible, instead of the longest.
>
> To illustrate, say we want to capture the contents of the following divs:
> $string = '<div>something</div><div>something else</div>';
>
> preg_match_all('%<div>(.*)</div>%',$string,$match1);
> preg_match_all('%<div>(.*?)</div>%',$string,$match2);
>
> print_r($match1);
> print_r($match2);
>
> Will give:
> Array
> (
> [0] => Array
> (
> [0] => <div>something</div><div>something else</div>
> )
>
> [1] => Array
> (
> [0] => something</div><div>something else
> )
>
> )
> Array
> (
> [0] => Array
> (
> [0] => <div>something</div>
> [1] => <div>something else</div>
> )
>
> [1] => Array
> (
> [0] => something
> [1] => something else
> )
>
> )
>
>
> --
> Rik Wasmus
>
>



Rik,

When I implement either of the two options above the regex stops working ?

$pattern='%<div[^>]*?class="overview"[^>]*?>
#start of overview
.*?
#allow random content between starting overview and header

<div[^>]*?class="header"[^>]*>
.*?
(?:<div[^>]*>.*?</div>.*?)*?
<h1>(?P<header>.*?)</h1>
.*?
(?:<div[^>]*>.*?</div>.*?)*?
</div>

.*?
#once again allow random content
<div[^>]*?class="content"[^>]*?>
#start of content
(?P<content>.*?(?:<div[^>]*?>.*?</div>.*?)*)
#get a named match from the content
</div>
#end of content
.*?
#I am not sure wether you need the code from this point on
<div[^>]*?class="break"[^>]*?></div>
#check for break
.*?
#some random content
</div>
#end of overview
%six';


I am trying to comprehend these expressions so I can solve them myself and
not trouble yourself however there are either very complex regexs or I am a
very slow learner... most likely the second :)

My breakdown and understanding of the regex above is:

<div[^>]*?class="overview"[^>]*?> #Match the start of the overview
========================================
match the string: <div
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: class="overview"
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: >

..*? #Match any content between the overview and header
========================================
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)

<div[^>]*?class="header"[^>]*> #Match the header

========================================
match the string: <div
match any character other than >
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: class="header"
match any character other than >
match 0 or more of the prev expressions until the last occurrance of the
next match is found (greedy)
match the string: >

(?:<div[^>]*>.*?</div>.*?)*? #Does this eliminate nested divs within the
header div ?
========================================
Non capturing pattern
match the string: <div
match 0 or more of the prev expressions until the last occurrance of the
next match is found (greedy)
match the string: >
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match the string: </div>
match any character
match 0 or more of the prev expressions only until the first occurrance of
the next match is found (non greedy)
match 0 or more of the prev expression in brackets only until the first
occurrance of the next match is found (non greedy)

<h1>(?P<header>.*?)</h1> #Match the contents between the h1 tags

========================================
match the string: <h1>
caputure all chars only until the first occurrance of the next match is
found (non greedy) and name the subpattern
match the string: <h2>


Thanks for all your help so far and I think I'm getting there...


Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 03:34 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0