This is a discussion on Help with a regular expression within the PHP Language forums, part of the PHP Programming Forums category; Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
Hi
I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form field beside name. I want to get the following but not sure how to modify the code below. 1. Field Name (to appear beside NAME:) 2. Field Type (to appear beside TYPE:) 3. Field Value (to appear beside VALUE:) Make sense. It is part way there, just need some help finishing it. $filename = "form-eg.php"; // Open file to read HTML with Form code $fd = fopen ($filename, "rb"); $contents = fread ($fd, filesize ($filename)); preg_match_all ('/<input.*?name\\s*=\\s*"?([^\\s>"]*)/i', $contents, $matches); // get all input fields and attributes and values for ($i=0; $i< count($matches[0]); $i++) { echo "matched: ".$matches[0][$i]."<br />\n"; echo "NAME: ".$matches[1][$i]."<br />\n"; echo "TYPE: ".$matches[3][$i]."<br />\n"; echo "VALUE: ".$matches[4][$i]."<br />\n\n"; } fclose ($fd); I will also need to run another check for : <select <textarea But I can probably figure that out from what I already have. Thanks, YoBro |
|
|||
|
YoBro wrote:
> I have used some of this code from the PHP manual, but I am bloody hopeless > with regular expressions. Although I've heard often enough that RXs are not the best tool for this job (try a HTML or XML parser) I do very well with them myself :) > Was hoping somebody could offer a hand. > > The output of this will put the name of a form field beside name. > I want to get the following but not sure how to modify the code below. > 1. Field Name (to appear beside NAME:) > 2. Field Type (to appear beside TYPE:) > 3. Field Value (to appear beside VALUE:) But I follow a different path than you. <?php // initialize result data $html_input = array(); $html_index = 0; // get HTML $contents = file_get_contents('http://www.faqs.org/rfcs/index.html'); // get all "<input ... >"s -- usually I'd group them by <form>s too preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs); // inside each "<input ... >" isolate the pairs "attr=value" foreach ($inputs[1] as $input) { // once for double quoted values preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches); // save them foreach ($matches[0] as $k=>$dummy) { $html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k]; } ++$html_index; // once for single quoted values preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches); foreach ($matches[0] as $k=>$dummy) { $html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k]; } ++$html_index; // and once again for unquoted values preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches); foreach ($matches[0] as $k=>$dummy) { $html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k]; } ++$html_index; } // done, deal with them anyway I like echo '<pre>'; print_r($html_inputs); echo '</pre>'; ?> -- --= my mail box only accepts =-- --= Content-Type: text/plain =-- --= Size below 10001 bytes =-- |
|
|||
|
Pedro Graca wrote:
> Although I've heard often enough that RXs are not the best tool for this > job (try a HTML or XML parser) I do very well with them myself :) I believe the principal reason why pre-written parsers are suggested and recommended instead of impromptu regular expression "one-liners" is that the gurus who've written and developed the parsers are usually aware of and understand the rules; the "one-line" regex implementors, on the other hand -- with all due respect -- generally aren't and don't. I'm not going to pretend I understand everything SGML; I certainly don't; I'm far too young for starters. I'd like to pass a few comments, nevertheless, which might change your mind about regular expressions for parsing (X)HTML. They changed my mind, anyway. You'll understand though, hopefully, why I haven't offered any regular expression in place of yours (no, it's not because I couldn't be bothered :-)). (Trying to cope with shorthand markup when using regexes would be a nightmare. Unlike proper parsers, I'm going to act like a browser and ignore shorthand markup, for the time being, as it'd complicate matters even more.) > // get all "<input ... >"s -- usually I'd group them by <form>s too > preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs); There's the standard mistake: the next occurrence of ">" does not necessarily mark the end of the tag. In HTML, a ">" can appear in *quoted* attribute values; it cannot appear in unquoted attribute values. This, for example, is a valid INPUT element (I make no claims to its logicality!) <INPUT title=">"> Also, INPUTs have no required attributes (that is, "<INPUT>" is valid), but the "+" quantifier matches *one* or more of whatever came before. To over-simplistically match INPUTs, I'd substitute "*" for "+". Since you're only wanting to match those INPUTs with explicit type, name and value attributes, however, that's inconsequential. > // inside each "<input ... >" isolate the pairs "attr=value" > foreach ($inputs[1] as $input) { > // once for double quoted values > preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches); An SGML name begins with a name start character and is followed by zero or more name characters. You'd match a name, for HTML4.01, with the pattern [a-zA-Z][a-zA-Z0-9.-_:]* An attribute value may be of length zero, so, again, the quantifier "*" ought to be used. And inside quoted attribute values, both "<" and ">" can appear. Alvaro G Vicario has just pointed this out too, in an article in the thread "php sticky forms", <news:1qih21wt0xy4e$.1f5ehf0s1tf5a$.dlg@40tude.net >. > // once for single quoted values > preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches); Ditto. > // and once again for unquoted values > preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches); Unquoted attribute values may only contain name characters. In HTML4.01, the pattern [a-zA-Z0-9.-_:]* matches name characters. Phew! Refs.: http://www.w3.org/TR/html401/sgml/sgmldecl.html http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm -- Jock |
|
|||
|
John Dunlop wrote:
> Pedro Graca wrote: > >> Although I've heard often enough that RXs are not the best tool for this >> job (try a HTML or XML parser) I do very well with them myself :) > I'd like to pass a few comments, nevertheless, which might change > your mind about regular expressions for parsing (X)HTML. Appreciate it. > They changed my mind, anyway. Changed my mind, too. Will take a little longer to change my scripts. But new scripts will not use regular expressions! > You'll understand though, hopefully, why I > haven't offered any regular expression in place of yours (no, it's > not because I couldn't be bothered :-)). Same reason I'm not changing them, I guess :-) > (Trying to cope with shorthand markup when using regexes would be a > nightmare. Unlike proper parsers, I'm going to act like a browser > and ignore shorthand markup, for the time being, as it'd complicate > matters even more.) Don't even mention that. (snip very good content) Thank you John. Thank you very much. -- --= my mail box only accepts =-- --= Content-Type: text/plain =-- --= Size below 10001 bytes =-- |
|
|||
|
Any idea of some real life working examples to do it the SGML way. Something
I have never heard of before. The reference links appear to have no relevance to what I am trying to do. There is a php function xml_parse, could this be used? The documentation is light on that topic. Thanks! "Pedro Graca" <hexkid@hotpop.com> wrote in message news:c2an23$1rap4q$1@ID-203069.news.uni-berlin.de... > John Dunlop wrote: > > Pedro Graca wrote: > > > >> Although I've heard often enough that RXs are not the best tool for this > >> job (try a HTML or XML parser) I do very well with them myself :) > > > I'd like to pass a few comments, nevertheless, which might change > > your mind about regular expressions for parsing (X)HTML. > > Appreciate it. > > > They changed my mind, anyway. > > Changed my mind, too. Will take a little longer to change my scripts. > But new scripts will not use regular expressions! > > > You'll understand though, hopefully, why I > > haven't offered any regular expression in place of yours (no, it's > > not because I couldn't be bothered :-)). > > Same reason I'm not changing them, I guess :-) > > > (Trying to cope with shorthand markup when using regexes would be a > > nightmare. Unlike proper parsers, I'm going to act like a browser > > and ignore shorthand markup, for the time being, as it'd complicate > > matters even more.) > > Don't even mention that. > > (snip very good content) > Thank you John. Thank you very much. > -- > --= my mail box only accepts =-- > --= Content-Type: text/plain =-- > --= Size below 10001 bytes =-- |
|
|||
|
I (Pedro Graca) wrote:
> Changed my mind, too. Will take a little longer to change my scripts. > But new scripts will not use regular expressions! Ufffffff. This took longer than I expected. The XML parser included with PHP gives errors for many of the pages I tested (most of them were HTML pages, so it's understandable :). I found a parser for HTML I like @ http://php-html.sourceforge.net/ #v+ <?php include 'htmlparser.inc.php'; // Yes! I changed the name // also changed short php tag $contents = file_get_contents('http://www.faqs.org/rfcs/index.html'); $parser = new HtmlParser($contents); while ($parser->parse()) { if (strtolower($parser->iNodeName) == 'input') { #echo "\niNodeType: "; print_r($parser->iNodeType); #echo "\niNodeName: "; print_r($parser->iNodeName); #echo "\niNodeValue: "; print_r($parser->iNodeValue); echo "\niNodeAttributes: "; print_r($parser->iNodeAttributes); } } echo "\n\nDone!\n"; ?> #v- and the result of this script is: iNodeAttributes: Array ( [name] => query [size] => 25 ) iNodeAttributes: Array ( [type] => submit [value] => Search RFCs ) iNodeAttributes: Array ( [name] => display [size] => 9 ) iNodeAttributes: Array ( [type] => submit [value] => Display RFC By Number ) Done! -- --= my mail box only accepts =-- --= Content-Type: text/plain =-- --= Size below 10001 bytes =-- |
|
|||
|
Hi,
Thanks, that is very helpful. I have tried to download this file but my browser keeps crashing when I get there. I don't suppose if you have a copy you could email it to me? ('htmlparser.inc.php') to: yobro@wazzup.co.nz. YoBro! "Pedro Graca" <hexkid@hotpop.com> wrote in message news:c2e627$1stv3d$1@ID-203069.news.uni-berlin.de... > I (Pedro Graca) wrote: > > Changed my mind, too. Will take a little longer to change my scripts. > > But new scripts will not use regular expressions! > > Ufffffff. This took longer than I expected. > > The XML parser included with PHP gives errors for many of the pages I > tested (most of them were HTML pages, so it's understandable :). > > I found a parser for HTML I like @ http://php-html.sourceforge.net/ > > #v+ > <?php > include 'htmlparser.inc.php'; // Yes! I changed the name > // also changed short php tag > > $contents = file_get_contents('http://www.faqs.org/rfcs/index.html'); > > $parser = new HtmlParser($contents); > while ($parser->parse()) { > if (strtolower($parser->iNodeName) == 'input') { > > #echo "\niNodeType: "; print_r($parser->iNodeType); > #echo "\niNodeName: "; print_r($parser->iNodeName); > #echo "\niNodeValue: "; print_r($parser->iNodeValue); > echo "\niNodeAttributes: "; print_r($parser->iNodeAttributes); > } > } > > echo "\n\nDone!\n"; > ?> > #v- > > and the result of this script is: > > iNodeAttributes: Array > ( > [name] => query > [size] => 25 > ) > > iNodeAttributes: Array > ( > [type] => submit > [value] => Search RFCs > ) > > iNodeAttributes: Array > ( > [name] => display > [size] => 9 > ) > > iNodeAttributes: Array > ( > [type] => submit > [value] => Display RFC By Number > ) > > > Done! > -- > --= my mail box only accepts =-- > --= Content-Type: text/plain =-- > --= Size below 10001 bytes =-- |
|
|||
|
YoBro top-posted:
> I have tried to download this file but my browser keeps crashing when I get > there. > > I don't suppose if you have a copy you could email it to me? Try here first :) https://sourceforge.net/project/show...group_id=91649 -- --= my mail box only accepts =-- --= Content-Type: text/plain =-- --= Size below 10001 bytes =-- |