PHP: extract links AND description from html

This is a discussion on PHP: extract links AND description from html within the alt.comp.lang.php forums, part of the PHP Programming Forums category; extracting just the links from a webpage is no problem for me -> regex /<a ([^>]*)>/i but ...


Go Back   Usenet Forums > PHP Programming Forums > alt.comp.lang.php

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 10-23-2004
Nils Jansen
 
Posts: n/a
Default PHP: extract links AND description from html

extracting just the links from a webpage is no problem for me ->
regex /<a ([^>]*)>/i

but now i want to extract the link and the discription that stands between
the <a href=> and the </a> tag.

as a result from the script that i'm searching for, i want to get the full

<a href=http://www.blabla.com/test/d.html>DESCRIPTOIN</a>

can anybody give me some hint, how to do this?




Reply With Quote
  #2 (permalink)  
Old 10-23-2004
Janwillem Borleffs
 
Posts: n/a
Default Re: extract links AND description from html

Nils Jansen wrote:
> as a result from the script that i'm searching for, i want to get the
> full
>
> <a href=http://www.blabla.com/test/d.html>DESCRIPTOIN</a>
>
> can anybody give me some hint, how to do this?


Try this (remark: array_combine is a PHP 5 specific function, see the manual
entry for this function on php.net for a PHP 4 example);

<?php

// Fetch the content
$file = file_get_contents("http://www.php.net/");

// Construct the regular expression
// (does not accept image links)
$reg = "#<a.*href\s*=\s*(\"|')?([^\"'>]+).*>([^<>]+)</a>#i";

// Parse $file
if (preg_match_all($reg, $file, $matches)) {
print "<pre>";
print_r(array_combine($matches[2], $matches[3]));
print "</pre>";
}

?>


HTH;
JW



Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 12:19 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0