Help with a regular expression

This is a discussion on Help with a regular expression within the PHP Language forums, part of the PHP Programming Forums category; Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 03-05-2004
YoBro
 
Posts: n/a
Default Help with a regular expression

Hi

I have used some of this code from the PHP manual, but I am bloody hopeless
with regular expressions.
Was hoping somebody could offer a hand.

The output of this will put the name of a form field beside name.
I want to get the following but not sure how to modify the code below.
1. Field Name (to appear beside NAME:)
2. Field Type (to appear beside TYPE:)
3. Field Value (to appear beside VALUE:)

Make sense.
It is part way there, just need some help finishing it.

$filename = "form-eg.php"; // Open file to read HTML with Form code
$fd = fopen ($filename, "rb");
$contents = fread ($fd, filesize ($filename));
preg_match_all ('/<input.*?name\\s*=\\s*"?([^\\s>"]*)/i', $contents,
$matches); // get all input fields and attributes and values

for ($i=0; $i< count($matches[0]); $i++) {
echo "matched: ".$matches[0][$i]."<br />\n";
echo "NAME: ".$matches[1][$i]."<br />\n";
echo "TYPE: ".$matches[3][$i]."<br />\n";
echo "VALUE: ".$matches[4][$i]."<br />\n\n";
}

fclose ($fd);

I will also need to run another check for :
<select
<textarea

But I can probably figure that out from what I already have.

Thanks,

YoBro


Reply With Quote
  #2 (permalink)  
Old 03-05-2004
Pedro Graca
 
Posts: n/a
Default Re: Help with a regular expression

YoBro wrote:
> I have used some of this code from the PHP manual, but I am bloody hopeless
> with regular expressions.


Although I've heard often enough that RXs are not the best tool for this
job (try a HTML or XML parser) I do very well with them myself :)

> Was hoping somebody could offer a hand.
>
> The output of this will put the name of a form field beside name.
> I want to get the following but not sure how to modify the code below.
> 1. Field Name (to appear beside NAME:)
> 2. Field Type (to appear beside TYPE:)
> 3. Field Value (to appear beside VALUE:)


But I follow a different path than you.

<?php
// initialize result data
$html_input = array();
$html_index = 0;

// get HTML
$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

// get all "<input ... >"s -- usually I'd group them by <form>s too
preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs);

// inside each "<input ... >" isolate the pairs "attr=value"
foreach ($inputs[1] as $input) {
// once for double quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);
// save them
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;

// once for single quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;

// and once again for unquoted values
preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;
}

// done, deal with them anyway I like
echo '<pre>'; print_r($html_inputs); echo '</pre>';
?>
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Reply With Quote
  #3 (permalink)  
Old 03-05-2004
John Dunlop
 
Posts: n/a
Default Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

Pedro Graca wrote:

> Although I've heard often enough that RXs are not the best tool for this
> job (try a HTML or XML parser) I do very well with them myself :)


I believe the principal reason why pre-written parsers are suggested
and recommended instead of impromptu regular expression "one-liners"
is that the gurus who've written and developed the parsers are
usually aware of and understand the rules; the "one-line" regex
implementors, on the other hand -- with all due respect -- generally
aren't and don't. I'm not going to pretend I understand everything
SGML; I certainly don't; I'm far too young for starters.

I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML. They
changed my mind, anyway. You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).

(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)

> // get all "<input ... >"s -- usually I'd group them by <form>s too
> preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs);


There's the standard mistake: the next occurrence of ">" does not
necessarily mark the end of the tag. In HTML, a ">" can appear in
*quoted* attribute values; it cannot appear in unquoted attribute
values. This, for example, is a valid INPUT element (I make no
claims to its logicality!)

<INPUT title=">">

Also, INPUTs have no required attributes (that is, "<INPUT>" is
valid), but the "+" quantifier matches *one* or more of whatever came
before. To over-simplistically match INPUTs, I'd substitute "*" for
"+". Since you're only wanting to match those INPUTs with explicit
type, name and value attributes, however, that's inconsequential.

> // inside each "<input ... >" isolate the pairs "attr=value"
> foreach ($inputs[1] as $input) {
> // once for double quoted values
> preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);


An SGML name begins with a name start character and is followed by
zero or more name characters. You'd match a name, for HTML4.01, with
the pattern

[a-zA-Z][a-zA-Z0-9.-_:]*

An attribute value may be of length zero, so, again, the quantifier
"*" ought to be used. And inside quoted attribute values, both "<"
and ">" can appear. Alvaro G Vicario has just pointed this out too,
in an article in the thread "php sticky forms",

<news:1qih21wt0xy4e$.1f5ehf0s1tf5a$.dlg@40tude.net >.

> // once for single quoted values
> preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);


Ditto.

> // and once again for unquoted values
> preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);


Unquoted attribute values may only contain name characters. In
HTML4.01, the pattern

[a-zA-Z0-9.-_:]*

matches name characters.

Phew!

Refs.:

http://www.w3.org/TR/html401/sgml/sgmldecl.html
http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm

--
Jock
Reply With Quote
  #4 (permalink)  
Old 03-05-2004
Pedro Graca
 
Posts: n/a
Default Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

John Dunlop wrote:
> Pedro Graca wrote:
>
>> Although I've heard often enough that RXs are not the best tool for this
>> job (try a HTML or XML parser) I do very well with them myself :)


> I'd like to pass a few comments, nevertheless, which might change
> your mind about regular expressions for parsing (X)HTML.


Appreciate it.

> They changed my mind, anyway.


Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!

> You'll understand though, hopefully, why I
> haven't offered any regular expression in place of yours (no, it's
> not because I couldn't be bothered :-)).


Same reason I'm not changing them, I guess :-)

> (Trying to cope with shorthand markup when using regexes would be a
> nightmare. Unlike proper parsers, I'm going to act like a browser
> and ignore shorthand markup, for the time being, as it'd complicate
> matters even more.)


Don't even mention that.

(snip very good content)
Thank you John. Thank you very much.
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Reply With Quote
  #5 (permalink)  
Old 03-05-2004
YoBro
 
Posts: n/a
Default Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

Any idea of some real life working examples to do it the SGML way. Something
I have never heard of before.

The reference links appear to have no relevance to what I am trying to do.

There is a php function xml_parse, could this be used?
The documentation is light on that topic.

Thanks!

"Pedro Graca" <hexkid@hotpop.com> wrote in message
news:c2an23$1rap4q$1@ID-203069.news.uni-berlin.de...
> John Dunlop wrote:
> > Pedro Graca wrote:
> >
> >> Although I've heard often enough that RXs are not the best tool for

this
> >> job (try a HTML or XML parser) I do very well with them myself :)

>
> > I'd like to pass a few comments, nevertheless, which might change
> > your mind about regular expressions for parsing (X)HTML.

>
> Appreciate it.
>
> > They changed my mind, anyway.

>
> Changed my mind, too. Will take a little longer to change my scripts.
> But new scripts will not use regular expressions!
>
> > You'll understand though, hopefully, why I
> > haven't offered any regular expression in place of yours (no, it's
> > not because I couldn't be bothered :-)).

>
> Same reason I'm not changing them, I guess :-)
>
> > (Trying to cope with shorthand markup when using regexes would be a
> > nightmare. Unlike proper parsers, I'm going to act like a browser
> > and ignore shorthand markup, for the time being, as it'd complicate
> > matters even more.)

>
> Don't even mention that.
>
> (snip very good content)
> Thank you John. Thank you very much.
> --
> --= my mail box only accepts =--
> --= Content-Type: text/plain =--
> --= Size below 10001 bytes =--



Reply With Quote
  #6 (permalink)  
Old 03-07-2004
Pedro Graca
 
Posts: n/a
Default Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

I (Pedro Graca) wrote:
> Changed my mind, too. Will take a little longer to change my scripts.
> But new scripts will not use regular expressions!


Ufffffff. This took longer than I expected.

The XML parser included with PHP gives errors for many of the pages I
tested (most of them were HTML pages, so it's understandable :).

I found a parser for HTML I like @ http://php-html.sourceforge.net/

#v+
<?php
include 'htmlparser.inc.php'; // Yes! I changed the name
// also changed short php tag

$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

$parser = new HtmlParser($contents);
while ($parser->parse()) {
if (strtolower($parser->iNodeName) == 'input') {

#echo "\niNodeType: "; print_r($parser->iNodeType);
#echo "\niNodeName: "; print_r($parser->iNodeName);
#echo "\niNodeValue: "; print_r($parser->iNodeValue);
echo "\niNodeAttributes: "; print_r($parser->iNodeAttributes);
}
}

echo "\n\nDone!\n";
?>
#v-

and the result of this script is:

iNodeAttributes: Array
(
[name] => query
[size] => 25
)

iNodeAttributes: Array
(
[type] => submit
[value] => Search RFCs
)

iNodeAttributes: Array
(
[name] => display
[size] => 9
)

iNodeAttributes: Array
(
[type] => submit
[value] => Display RFC By Number
)


Done!
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Reply With Quote
  #7 (permalink)  
Old 03-07-2004
YoBro
 
Posts: n/a
Default Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

Hi,

Thanks, that is very helpful.
I have tried to download this file but my browser keeps crashing when I get
there.

I don't suppose if you have a copy you could email it to me?
('htmlparser.inc.php')
to: yobro@wazzup.co.nz.

YoBro!




"Pedro Graca" <hexkid@hotpop.com> wrote in message
news:c2e627$1stv3d$1@ID-203069.news.uni-berlin.de...
> I (Pedro Graca) wrote:
> > Changed my mind, too. Will take a little longer to change my scripts.
> > But new scripts will not use regular expressions!

>
> Ufffffff. This took longer than I expected.
>
> The XML parser included with PHP gives errors for many of the pages I
> tested (most of them were HTML pages, so it's understandable :).
>
> I found a parser for HTML I like @ http://php-html.sourceforge.net/
>
> #v+
> <?php
> include 'htmlparser.inc.php'; // Yes! I changed the name
> // also changed short php tag
>
> $contents = file_get_contents('http://www.faqs.org/rfcs/index.html');
>
> $parser = new HtmlParser($contents);
> while ($parser->parse()) {
> if (strtolower($parser->iNodeName) == 'input') {
>
> #echo "\niNodeType: "; print_r($parser->iNodeType);
> #echo "\niNodeName: "; print_r($parser->iNodeName);
> #echo "\niNodeValue: "; print_r($parser->iNodeValue);
> echo "\niNodeAttributes: "; print_r($parser->iNodeAttributes);
> }
> }
>
> echo "\n\nDone!\n";
> ?>
> #v-
>
> and the result of this script is:
>
> iNodeAttributes: Array
> (
> [name] => query
> [size] => 25
> )
>
> iNodeAttributes: Array
> (
> [type] => submit
> [value] => Search RFCs
> )
>
> iNodeAttributes: Array
> (
> [name] => display
> [size] => 9
> )
>
> iNodeAttributes: Array
> (
> [type] => submit
> [value] => Display RFC By Number
> )
>
>
> Done!
> --
> --= my mail box only accepts =--
> --= Content-Type: text/plain =--
> --= Size below 10001 bytes =--



Reply With Quote
  #8 (permalink)  
Old 03-07-2004
Pedro Graca
 
Posts: n/a
Default Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

YoBro top-posted:
> I have tried to download this file but my browser keeps crashing when I get
> there.
>
> I don't suppose if you have a copy you could email it to me?


Try here first :)
https://sourceforge.net/project/show...group_id=91649
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 07:32 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0