regular expression to extract text

This is a discussion on regular expression to extract text within the PHP Language forums, part of the PHP Programming Forums category; Hi I have an html file with headings followed by one or more paragraphs like this <h2>blah ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 11-25-2007
suzanne.boyle@gmail.com
 
Posts: n/a
Default regular expression to extract text

Hi

I have an html file with headings followed by one or more paragraphs
like this

<h2>blah blah 1</h2>
<p>more blah blah blah</p>

<h2>blah blah 2</h2>
<p>more blah blah blah</p>
<p>even more blah blah blah</p>

I'd like to extract the text of the headings and the related
paragraphs and insert them into a database. So far I've managed to
get the heading text but cant figure out how to get the associated
paragraphs. I've been using regular expressions, here is the
expression I have so far <h2[.]*>(.+?)</h2>(.+?). This gets the text
of the headings but not the paragraphs and now I'm basically stumped.

Any help would be appreciated.
Reply With Quote
  #2 (permalink)  
Old 11-25-2007
shimmyshack
 
Posts: n/a
Default Re: regular expression to extract text

On Nov 25, 9:48 pm, suzanne.bo...@gmail.com wrote:
> Hi
>
> I have an html file with headings followed by one or more paragraphs
> like this
>
> <h2>blah blah 1</h2>
> <p>more blah blah blah</p>
>
> <h2>blah blah 2</h2>
> <p>more blah blah blah</p>
> <p>even more blah blah blah</p>
>
> I'd like to extract the text of the headings and the related
> paragraphs and insert them into a database. So far I've managed to
> get the heading text but cant figure out how to get the associated
> paragraphs. I've been using regular expressions, here is the
> expression I have so far <h2[.]*>(.+?)</h2>(.+?). This gets the text
> of the headings but not the paragraphs and now I'm basically stumped.
>
> Any help would be appreciated.


you could do this another way, although reg exp is a great way.
have you thought that you could use xml to so this.
since you are obviosuly starting with something which is basically
xml, why not just load the string as xml (topping and tailing it if
needed) and then extract using xpath.
Reply With Quote
  #3 (permalink)  
Old 11-26-2007
Kailash Nadh
 
Posts: n/a
Default Re: regular expression to extract text

Slightly unorthodox, but this works.

<?php

preg_match_all("/((<h2>(.+?)<\/h2>(.+?)<p>(.+?)<\/p>))/is", $html,
$matches);
print_r($matches);

// array[3] would be headings and array[5] would be the related
paragraph text
?>
Reply With Quote
  #4 (permalink)  
Old 11-26-2007
suzanne.boyle@gmail.com
 
Posts: n/a
Default Re: regular expression to extract text

The problem with using xml is that the html is coming from Word so it
contains a lot of unnecessary crap and isn't valid xml. And since I
don't have much experience parsing xml in php I thought it would be
easier to use regular expressions to extract the sections I want.

And I'm almost there now, the expression Kailash wrote almost works
but it only gives the first paragraph after the heading. I just need
to work out how to extract the rest of the paragraphs.
Reply With Quote
  #5 (permalink)  
Old 11-26-2007
Toby A Inkster
 
Posts: n/a
Default Re: regular expression to extract text

suzanne.boyle wrote:

> The problem with using xml is that the html is coming from Word so it
> contains a lot of unnecessary crap and isn't valid xml. And since I
> don't have much experience parsing xml in php I thought it would be
> easier to use regular expressions to extract the sections I want.


You could do worse than trying XML_HTMLSax3. I've previously posted an
example of using it to parse HTML:

http://tobyinkster.co.uk/blog/2007/0...table-parsing/

Note that it does not require documents to be well-formed XML.

--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 2 days, 2:18.]

It'll be in the Last Place You Look
http://tobyinkster.co.uk/blog/2007/11/21/no2id/
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 07:57 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0