Regex help

This is a discussion on Regex help within the PHP Language forums, part of the PHP Programming Forums category; OK, I give up here. I am DEFINITELY not a Regex expert, and have been working on this for hours ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 10-15-2007
Jerry Stuckle
 
Posts: n/a
Default Regex help

OK, I give up here. I am DEFINITELY not a Regex expert, and have been
working on this for hours with no luck.

Basically I need to parse a page for certain information which will be
fed back into CURL to post to a site. I need to find four types of tags
on the page:

<input type=hidden name=a1 value=b1>
<input type=text name=a2>
<input type=submit name=a3 value=b3>
<select name=a4>

I don't need any other tags.

From the hidden and submit types, I need name and value. From the text
and select types, I just need the name.

I can assume the attributes will always show up in this order, but there
may be other things between the < and > delimiters. Additionally, the
actual type and name may have single or double quotes around them, or
neither.

Does anyone have some code for this? It doesn't have to be all one regex.

TIA.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Reply With Quote
  #2 (permalink)  
Old 10-15-2007
Steve
 
Posts: n/a
Default Re: Regex help


"Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
news:KaadnQnnGt0WT4_anZ2dnUVZ_tajnZ2d@comcast.com. ..
> OK, I give up here. I am DEFINITELY not a Regex expert, and have been
> working on this for hours with no luck.
>
> Basically I need to parse a page for certain information which will be fed
> back into CURL to post to a site. I need to find four types of tags on
> the page:
>
> <input type=hidden name=a1 value=b1>
> <input type=text name=a2>
> <input type=submit name=a3 value=b3>
> <select name=a4>
>
> I don't need any other tags.
>
> From the hidden and submit types, I need name and value. From the text
> and select types, I just need the name.
>
> I can assume the attributes will always show up in this order, but there
> may be other things between the < and > delimiters. Additionally, the
> actual type and name may have single or double quotes around them, or
> neither.
>
> Does anyone have some code for this? It doesn't have to be all one regex.


alright, jer. let's see what we can do...

here's an eyeballed attempt:

<(select\s?[^>].*?)|(input\s[^t]*?type\s*?=\s?('|"|\s)(hidden|text|submit)\3[^>].*?)>

to keep it easier, i'd think about using that to get your general matches.
iterating through those, i'd apply another regex to break out the name,
type, and value. you could very well catch it all in the above, however,
it's not as straightforward and hence, not easily maintained. if you need
additional help on writing this, let me know. i'll psuedo-code the whole
enchillada if you want. this should be sufficient in getting only those tags
you listed above...which is a good start.

btw, make the seach caseINsensitive.


Reply With Quote
  #3 (permalink)  
Old 10-15-2007
Captain Paralytic
 
Posts: n/a
Default Re: Regex help

On 15 Oct, 03:37, Jerry Stuckle <jstuck...@attglobal.net> wrote:
> OK, I give up here. I am DEFINITELY not a Regex expert, and have been
> working on this for hours with no luck.
>
> Basically I need to parse a page for certain information which will be
> fed back into CURL to post to a site. I need to find four types of tags
> on the page:
>
> <input type=hidden name=a1 value=b1>
> <input type=text name=a2>
> <input type=submit name=a3 value=b3>
> <select name=a4>
>
> I don't need any other tags.
>
> From the hidden and submit types, I need name and value. From the text
> and select types, I just need the name.
>
> I can assume the attributes will always show up in this order, but there
> may be other things between the < and > delimiters. Additionally, the
> actual type and name may have single or double quotes around them, or
> neither.
>
> Does anyone have some code for this? It doesn't have to be all one regex.
>
> TIA.
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstuck...@attglobal.net
> ==================


Could you use the php dom functionality for this?

Wouldn't it be good if php had the equivalent of
getElementsByTagName()!

Reply With Quote
  #4 (permalink)  
Old 10-15-2007
Jerry Stuckle
 
Posts: n/a
Default Re: Regex help

Steve wrote:
> "Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
> news:KaadnQnnGt0WT4_anZ2dnUVZ_tajnZ2d@comcast.com. ..
>> OK, I give up here. I am DEFINITELY not a Regex expert, and have been
>> working on this for hours with no luck.
>>
>> Basically I need to parse a page for certain information which will be fed
>> back into CURL to post to a site. I need to find four types of tags on
>> the page:
>>
>> <input type=hidden name=a1 value=b1>
>> <input type=text name=a2>
>> <input type=submit name=a3 value=b3>
>> <select name=a4>
>>
>> I don't need any other tags.
>>
>> From the hidden and submit types, I need name and value. From the text
>> and select types, I just need the name.
>>
>> I can assume the attributes will always show up in this order, but there
>> may be other things between the < and > delimiters. Additionally, the
>> actual type and name may have single or double quotes around them, or
>> neither.
>>
>> Does anyone have some code for this? It doesn't have to be all one regex.

>
> alright, jer. let's see what we can do...
>
> here's an eyeballed attempt:
>
> <(select\s?[^>].*?)|(input\s[^t]*?type\s*?=\s?('|"|\s)(hidden|text|submit)\3[^>].*?)>
>
> to keep it easier, i'd think about using that to get your general matches.
> iterating through those, i'd apply another regex to break out the name,
> type, and value. you could very well catch it all in the above, however,
> it's not as straightforward and hence, not easily maintained. if you need
> additional help on writing this, let me know. i'll psuedo-code the whole
> enchillada if you want. this should be sufficient in getting only those tags
> you listed above...which is a good start.
>
> btw, make the seach caseINsensitive.
>
>
>


Hi, Steve,

Yep, it's a start. Some problems (output below), but I think it will
get me a little farther.

And you're right, I already gave up on getting everything in one pass.
I was thinking of trying to just get everything for a single element
type (i.e. all <input type=text ...> elements), but this gives me
another idea, also.

And the output from the first try:

Array
(
[0] => Array
(
[0] => <select n
[1] => <select n
[2] => <select n
)

[1] => Array
(
[0] => select n
[1] => select n
[2] => select n
)

[2] => Array
(
[0] =>
[1] =>
[2] =>
)

[3] => Array
(
[0] =>
[1] =>
[2] =>
)

[4] => Array
(
[0] =>
[1] =>
[2] =>
)

)



--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Reply With Quote
  #5 (permalink)  
Old 10-15-2007
Jerry Stuckle
 
Posts: n/a
Default Re: Regex help

Captain Paralytic wrote:
> On 15 Oct, 03:37, Jerry Stuckle <jstuck...@attglobal.net> wrote:
>> OK, I give up here. I am DEFINITELY not a Regex expert, and have been
>> working on this for hours with no luck.
>>
>> Basically I need to parse a page for certain information which will be
>> fed back into CURL to post to a site. I need to find four types of tags
>> on the page:
>>
>> <input type=hidden name=a1 value=b1>
>> <input type=text name=a2>
>> <input type=submit name=a3 value=b3>
>> <select name=a4>
>>
>> I don't need any other tags.
>>
>> From the hidden and submit types, I need name and value. From the text
>> and select types, I just need the name.
>>
>> I can assume the attributes will always show up in this order, but there
>> may be other things between the < and > delimiters. Additionally, the
>> actual type and name may have single or double quotes around them, or
>> neither.
>>
>> Does anyone have some code for this? It doesn't have to be all one regex.
>>
>> TIA.
>>
>> --
>> ==================
>> Remove the "x" from my email address
>> Jerry Stuckle
>> JDS Computer Training Corp.
>> jstuck...@attglobal.net
>> ==================

>
> Could you use the php dom functionality for this?
>
> Wouldn't it be good if php had the equivalent of
> getElementsByTagName()!
>
>


Hi, Paul,

How I wish I could - it was the first thing I tried. However, this page
is not well formed html, and DOM throws up all over it.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Reply With Quote
  #6 (permalink)  
Old 10-15-2007
Captain Paralytic
 
Posts: n/a
Default Re: Regex help

On 15 Oct, 11:17, Jerry Stuckle <jstuck...@attglobal.net> wrote:
> Captain Paralytic wrote:
> > On 15 Oct, 03:37, Jerry Stuckle <jstuck...@attglobal.net> wrote:
> >> OK, I give up here. I am DEFINITELY not a Regex expert, and have been
> >> working on this for hours with no luck.

>
> >> Basically I need to parse a page for certain information which will be
> >> fed back into CURL to post to a site. I need to find four types of tags
> >> on the page:

>
> >> <input type=hidden name=a1 value=b1>
> >> <input type=text name=a2>
> >> <input type=submit name=a3 value=b3>
> >> <select name=a4>

>
> >> I don't need any other tags.

>
> >> From the hidden and submit types, I need name and value. From the text
> >> and select types, I just need the name.

>
> >> I can assume the attributes will always show up in this order, but there
> >> may be other things between the < and > delimiters. Additionally, the
> >> actual type and name may have single or double quotes around them, or
> >> neither.

>
> >> Does anyone have some code for this? It doesn't have to be all one regex.

>
> >> TIA.

>
> >> --
> >> ==================
> >> Remove the "x" from my email address
> >> Jerry Stuckle
> >> JDS Computer Training Corp.
> >> jstuck...@attglobal.net
> >> ==================

>
> > Could you use the php dom functionality for this?

>
> > Wouldn't it be good if php had the equivalent of
> > getElementsByTagName()!

>
> Hi, Paul,
>
> How I wish I could - it was the first thing I tried. However, this page
> is not well formed html, and DOM throws up all over it.
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstuck...@attglobal.net
> ==================- Hide quoted text -
>
> - Show quoted text -


Of course, when I said: "Wouldn't it be good if php had the equivalent
of getElementsByTagName()!", I meant that it would be good if its
version was as tolerant as javascript's one.

Reply With Quote
  #7 (permalink)  
Old 10-15-2007
Steve
 
Posts: n/a
Default Re: Regex help


"Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
news:K-qdnTSkY4NaoI7anZ2dnUVZ_j6dnZ2d@comcast.com...
> Steve wrote:
>> "Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
>> news:KaadnQnnGt0WT4_anZ2dnUVZ_tajnZ2d@comcast.com. ..
>>> OK, I give up here. I am DEFINITELY not a Regex expert, and have been
>>> working on this for hours with no luck.
>>>
>>> Basically I need to parse a page for certain information which will be
>>> fed back into CURL to post to a site. I need to find four types of tags
>>> on the page:
>>>
>>> <input type=hidden name=a1 value=b1>
>>> <input type=text name=a2>
>>> <input type=submit name=a3 value=b3>
>>> <select name=a4>
>>>
>>> I don't need any other tags.
>>>
>>> From the hidden and submit types, I need name and value. From the text
>>> and select types, I just need the name.
>>>
>>> I can assume the attributes will always show up in this order, but there
>>> may be other things between the < and > delimiters. Additionally, the
>>> actual type and name may have single or double quotes around them, or
>>> neither.
>>>
>>> Does anyone have some code for this? It doesn't have to be all one
>>> regex.

>>
>> alright, jer. let's see what we can do...
>>
>> here's an eyeballed attempt:
>>
>> <(select\s?[^>].*?)|(input\s[^t]*?type\s*?=\s?('|"|\s)(hidden|text|submit)\3[^>].*?)>
>>
>> to keep it easier, i'd think about using that to get your general
>> matches. iterating through those, i'd apply another regex to break out
>> the name, type, and value. you could very well catch it all in the above,
>> however, it's not as straightforward and hence, not easily maintained. if
>> you need additional help on writing this, let me know. i'll psuedo-code
>> the whole enchillada if you want. this should be sufficient in getting
>> only those tags you listed above...which is a good start.
>>
>> btw, make the seach caseINsensitive.

>
> Hi, Steve,
>
> Yep, it's a start. Some problems (output below), but I think it will get
> me a little farther.
>
> And you're right, I already gave up on getting everything in one pass. I
> was thinking of trying to just get everything for a single element type
> (i.e. all <input type=text ...> elements), but this gives me another idea,
> also.
>
> And the output from the first try:
>
> Array
> (
> [0] => Array
> (
> [0] => <select n
> [1] => <select n
> [2] => <select n
> )
>
> [1] => Array
> (
> [0] => select n
> [1] => select n
> [2] => select n
> )
>
> [2] => Array
> (
> [0] =>
> [1] =>
> [2] =>
> )
>
> [3] => Array
> (
> [0] =>
> [1] =>
> [2] =>
> )
>
> [4] => Array
> (
> [0] =>
> [1] =>
> [2] =>
> )
>
> )


well, that's no so good a start! i'll break out the old regex ide and fix
that...if you want.


Reply With Quote
  #8 (permalink)  
Old 10-15-2007
Jerry Stuckle
 
Posts: n/a
Default Re: Regex help

Steve wrote:
> "Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
> news:K-qdnTSkY4NaoI7anZ2dnUVZ_j6dnZ2d@comcast.com...
>> Steve wrote:
>>> "Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
>>> news:KaadnQnnGt0WT4_anZ2dnUVZ_tajnZ2d@comcast.com. ..
>>>> OK, I give up here. I am DEFINITELY not a Regex expert, and have been
>>>> working on this for hours with no luck.
>>>>
>>>> Basically I need to parse a page for certain information which will be
>>>> fed back into CURL to post to a site. I need to find four types of tags
>>>> on the page:
>>>>
>>>> <input type=hidden name=a1 value=b1>
>>>> <input type=text name=a2>
>>>> <input type=submit name=a3 value=b3>
>>>> <select name=a4>
>>>>
>>>> I don't need any other tags.
>>>>
>>>> From the hidden and submit types, I need name and value. From the text
>>>> and select types, I just need the name.
>>>>
>>>> I can assume the attributes will always show up in this order, but there
>>>> may be other things between the < and > delimiters. Additionally, the
>>>> actual type and name may have single or double quotes around them, or
>>>> neither.
>>>>
>>>> Does anyone have some code for this? It doesn't have to be all one
>>>> regex.
>>> alright, jer. let's see what we can do...
>>>
>>> here's an eyeballed attempt:
>>>
>>> <(select\s?[^>].*?)|(input\s[^t]*?type\s*?=\s?('|"|\s)(hidden|text|submit)\3[^>].*?)>
>>>
>>> to keep it easier, i'd think about using that to get your general
>>> matches. iterating through those, i'd apply another regex to break out
>>> the name, type, and value. you could very well catch it all in the above,
>>> however, it's not as straightforward and hence, not easily maintained. if
>>> you need additional help on writing this, let me know. i'll psuedo-code
>>> the whole enchillada if you want. this should be sufficient in getting
>>> only those tags you listed above...which is a good start.
>>>
>>> btw, make the seach caseINsensitive.

>> Hi, Steve,
>>
>> Yep, it's a start. Some problems (output below), but I think it will get
>> me a little farther.
>>
>> And you're right, I already gave up on getting everything in one pass. I
>> was thinking of trying to just get everything for a single element type
>> (i.e. all <input type=text ...> elements), but this gives me another idea,
>> also.
>>
>> And the output from the first try:
>>
>> Array
>> (
>> [0] => Array
>> (
>> [0] => <select n
>> [1] => <select n
>> [2] => <select n
>> )
>>
>> [1] => Array
>> (
>> [0] => select n
>> [1] => select n
>> [2] => select n
>> )
>>
>> [2] => Array
>> (
>> [0] =>
>> [1] =>
>> [2] =>
>> )
>>
>> [3] => Array
>> (
>> [0] =>
>> [1] =>
>> [2] =>
>> )
>>
>> [4] => Array
>> (
>> [0] =>
>> [1] =>
>> [2] =>
>> )
>>
>> )

>
> well, that's no so good a start! i'll break out the old regex ide and fix
> that...if you want.
>
>
>


If you have the time, I would appreciate it. Otherwise I can struggle
through this myself :-)

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Reply With Quote
  #9 (permalink)  
Old 10-15-2007
Jerry Stuckle
 
Posts: n/a
Default Re: Regex help

Captain Paralytic wrote:
> On 15 Oct, 11:17, Jerry Stuckle <jstuck...@attglobal.net> wrote:
>> Captain Paralytic wrote:
>>> On 15 Oct, 03:37, Jerry Stuckle <jstuck...@attglobal.net> wrote:
>>>> OK, I give up here. I am DEFINITELY not a Regex expert, and have been
>>>> working on this for hours with no luck.
>>>> Basically I need to parse a page for certain information which will be
>>>> fed back into CURL to post to a site. I need to find four types of tags
>>>> on the page:
>>>> <input type=hidden name=a1 value=b1>
>>>> <input type=text name=a2>
>>>> <input type=submit name=a3 value=b3>
>>>> <select name=a4>
>>>> I don't need any other tags.
>>>> From the hidden and submit types, I need name and value. From the text
>>>> and select types, I just need the name.
>>>> I can assume the attributes will always show up in this order, but there
>>>> may be other things between the < and > delimiters. Additionally, the
>>>> actual type and name may have single or double quotes around them, or
>>>> neither.
>>>> Does anyone have some code for this? It doesn't have to be all one regex.
>>>> TIA.
>>>> --
>>>> ==================
>>>> Remove the "x" from my email address
>>>> Jerry Stuckle
>>>> JDS Computer Training Corp.
>>>> jstuck...@attglobal.net
>>>> ==================
>>> Could you use the php dom functionality for this?
>>> Wouldn't it be good if php had the equivalent of
>>> getElementsByTagName()!

>> Hi, Paul,
>>
>> How I wish I could - it was the first thing I tried. However, this page
>> is not well formed html, and DOM throws up all over it.
>>
>> --
>> ==================
>> Remove the "x" from my email address
>> Jerry Stuckle
>> JDS Computer Training Corp.
>> jstuck...@attglobal.net
>> ==================- Hide quoted text -
>>
>> - Show quoted text -

>
> Of course, when I said: "Wouldn't it be good if php had the equivalent
> of getElementsByTagName()!", I meant that it would be good if its
> version was as tolerant as javascript's one.
>
>


Very true, Paul!

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Reply With Quote
  #10 (permalink)  
Old 10-15-2007
Steve
 
Posts: n/a
Default Re: Regex help


"Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
news:u9WdnU2yhZ2Q5o7anZ2dnUVZ_trinZ2d@comcast.com. ..
> Steve wrote:
>> "Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
>> news:K-qdnTSkY4NaoI7anZ2dnUVZ_j6dnZ2d@comcast.com...
>>> Steve wrote:
>>>> "Jerry Stuckle" <jstucklex@attglobal.net> wrote in message
>>>> news:KaadnQnnGt0WT4_anZ2dnUVZ_tajnZ2d@comcast.com. ..
>>>>> OK, I give up here. I am DEFINITELY not a Regex expert, and have been
>>>>> working on this for hours with no luck.
>>>>>
>>>>> Basically I need to parse a page for certain information which will be
>>>>> fed back into CURL to post to a site. I need to find four types of
>>>>> tags on the page:
>>>>>
>>>>> <input type=hidden name=a1 value=b1>
>>>>> <input type=text name=a2>
>>>>> <input type=submit name=a3 value=b3>
>>>>> <select name=a4>
>>>>>
>>>>> I don't need any other tags.
>>>>>
>>>>> From the hidden and submit types, I need name and value. From the
>>>>> text and select types, I just need the name.
>>>>>
>>>>> I can assume the attributes will always show up in this order, but
>>>>> there may be other things between the < and > delimiters.
>>>>> Additionally, the actual type and name may have single or double
>>>>> quotes around them, or neither.
>>>>>
>>>>> Does anyone have some code for this? It doesn't have to be all one
>>>>> regex.
>>>> alright, jer. let's see what we can do...
>>>>
>>>> here's an eyeballed attempt:
>>>>
>>>> <(select\s?[^>].*?)|(input\s[^t]*?type\s*?=\s?('|"|\s)(hidden|text|submit)\3[^>].*?)>
>>>>
>>>> to keep it easier, i'd think about using that to get your general
>>>> matches. iterating through those, i'd apply another regex to break out
>>>> the name, type, and value. you could very well catch it all in the
>>>> above, however, it's not as straightforward and hence, not easily
>>>> maintained. if you need additional help on writing this, let me know.
>>>> i'll psuedo-code the whole enchillada if you want. this should be
>>>> sufficient in getting only those tags you listed above...which is a
>>>> good start.
>>>>
>>>> btw, make the seach caseINsensitive.
>>> Hi, Steve,
>>>
>>> Yep, it's a start. Some problems (output below), but I think it will
>>> get me a little farther.
>>>
>>> And you're right, I already gave up on getting everything in one pass. I
>>> was thinking of trying to just get everything for a single element type
>>> (i.e. all <input type=text ...> elements), but this gives me another
>>> idea, also.
>>>
>>> And the output from the first try:
>>>
>>> Array
>>> (
>>> [0] => Array
>>> (
>>> [0] => <select n
>>> [1] => <select n
>>> [2] => <select n
>>> )
>>>
>>> [1] => Array
>>> (
>>> [0] => select n
>>> [1] => select n
>>> [2] => select n
>>> )
>>>
>>> [2] => Array
>>> (
>>> [0] =>
>>> [1] =>
>>> [2] =>
>>> )
>>>
>>> [3] => Array
>>> (
>>> [0] =>
>>> [1] =>
>>> [2] =>
>>> )
>>>
>>> [4] => Array
>>> (
>>> [0] =>
>>> [1] =>
>>> [2] =>
>>> )
>>>
>>> )

>>
>> well, that's no so good a start! i'll break out the old regex ide and fix
>> that...if you want.

>
> If you have the time, I would appreciate it. Otherwise I can struggle
> through this myself :-)


ok, here's the one to get the select:

(select)\s*?[^n].*?(name)\s*?=\s*?(?:\'|")?([^\3>]*)?\3?\s*?[^>]

here's the one to break out the inputs and capture each type, name, and
value:

(input)\s*?[^n].*?(?:(name|type|value)\s*?=\s*?(?:'|")?([^\2>]*?)\2?(?:\s)?)*?>

the problem with this one though, is that it debugs fine in 'the regulator'
regex ide. however, some of the captures are being overwritten under
preg_match_all.

the implementation would have been an array of these two patterns. preg
should return the type (select or input)...from that point, you'd know where
in the matches to find the type, name, and value regardless of the order in
which it came. as it is, you can use $matches[0][...n] on the input pattern
matches to iterate the full input match.

hope that helps.


Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 08:00 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0