This is a discussion on regular expression to extract text within the PHP Language forums, part of the PHP Programming Forums category; Hi I have an html file with headings followed by one or more paragraphs like this <h2>blah ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
Hi
I have an html file with headings followed by one or more paragraphs like this <h2>blah blah 1</h2> <p>more blah blah blah</p> <h2>blah blah 2</h2> <p>more blah blah blah</p> <p>even more blah blah blah</p> I'd like to extract the text of the headings and the related paragraphs and insert them into a database. So far I've managed to get the heading text but cant figure out how to get the associated paragraphs. I've been using regular expressions, here is the expression I have so far <h2[.]*>(.+?)</h2>(.+?). This gets the text of the headings but not the paragraphs and now I'm basically stumped. Any help would be appreciated. |
|
|||
|
On Nov 25, 9:48 pm, suzanne.bo...@gmail.com wrote:
> Hi > > I have an html file with headings followed by one or more paragraphs > like this > > <h2>blah blah 1</h2> > <p>more blah blah blah</p> > > <h2>blah blah 2</h2> > <p>more blah blah blah</p> > <p>even more blah blah blah</p> > > I'd like to extract the text of the headings and the related > paragraphs and insert them into a database. So far I've managed to > get the heading text but cant figure out how to get the associated > paragraphs. I've been using regular expressions, here is the > expression I have so far <h2[.]*>(.+?)</h2>(.+?). This gets the text > of the headings but not the paragraphs and now I'm basically stumped. > > Any help would be appreciated. you could do this another way, although reg exp is a great way. have you thought that you could use xml to so this. since you are obviosuly starting with something which is basically xml, why not just load the string as xml (topping and tailing it if needed) and then extract using xpath. |
|
|||
|
The problem with using xml is that the html is coming from Word so it
contains a lot of unnecessary crap and isn't valid xml. And since I don't have much experience parsing xml in php I thought it would be easier to use regular expressions to extract the sections I want. And I'm almost there now, the expression Kailash wrote almost works but it only gives the first paragraph after the heading. I just need to work out how to extract the rest of the paragraphs. |
|
|||
|
suzanne.boyle wrote:
> The problem with using xml is that the html is coming from Word so it > contains a lot of unnecessary crap and isn't valid xml. And since I > don't have much experience parsing xml in php I thought it would be > easier to use regular expressions to extract the sections I want. You could do worse than trying XML_HTMLSax3. I've previously posted an example of using it to parse HTML: http://tobyinkster.co.uk/blog/2007/0...table-parsing/ Note that it does not require documents to be well-formed XML. -- Toby A Inkster BSc (Hons) ARCS [Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux] [OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 2 days, 2:18.] It'll be in the Last Place You Look http://tobyinkster.co.uk/blog/2007/11/21/no2id/ |