regex question

This is a discussion on regex question within the Linux General forums, part of the Linux Forums category; I need to find patterns like these (e.g. with sed or perl or grep): G1150G111 00443E104 etc. That is, ...


Go Back   Usenet Forums > Linux Forums > Linux General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 05-25-2008
Amadeus W.M.
 
Posts: n/a
Default regex question

I need to find patterns like these (e.g. with sed or perl or grep):

G1150G111
00443E104

etc. That is, 9 digit words made only of letters or digits, of which at
least one character is a digit. The letters can occur in random positions.

What would be the pattern to match? Thanks!

Reply With Quote
  #2 (permalink)  
Old 05-26-2008
Marcel Bruinsma
 
Posts: n/a
Default Re: regex question

In article <pan.2008.05.25.17.19.45@verizon.net>,
Amadeus W.M. wrote:

> I need to find patterns like these (e.g. with sed or perl or grep):
>
> G1150G111
> 00443E104
>
> etc. That is, 9 digit words made only of letters or digits, of which
> at least one character is a digit. The letters can occur in random
> positions.


perl -ne '
while (/(^|[^[:alnum:]])([[:alnum:]]{9})([^[:alnum:]]|$)/g) {
if ($2 =~ /[[:digit:]]/) {print;last;}
}' <infile >outfile


Regards,
Marcel

--
printf -v email $(echo \ 155 141 162 143 145 154 155 141 162 \
143 145 154 100 157 162 141 156 147 145 56 156 154 | tr \ \\)
# O Herr, lass Hirn vom Himmel fallen! #
Reply With Quote
  #3 (permalink)  
Old 05-26-2008
Amadeus W.M.
 
Posts: n/a
Default Re: regex question

On Mon, 26 May 2008 00:10:08 +0200, Marcel Bruinsma wrote:

> In article <pan.2008.05.25.17.19.45@verizon.net>,
> Amadeus W.M. wrote:
>
>> I need to find patterns like these (e.g. with sed or perl or grep):
>>
>> G1150G111
>> 00443E104
>>
>> etc. That is, 9 digit words made only of letters or digits, of which at
>> least one character is a digit. The letters can occur in random
>> positions.

>
> perl -ne '
> while (/(^|[^[:alnum:]])([[:alnum:]]{9})([^[:alnum:]]|$)/g) {
> if ($2 =~ /[[:digit:]]/) {print;last;}
> }' <infile >outfile
>
>
> Regards,
> Marcel


Thanks! I'm not sure this will work for what I need though. Perhaps my
initial question was incomplete. I have a file with many lines of the form


company type companyId $amount #shares etc.

For instance:

GENERAL MTRS CORP Preferred 370442691 4,602 200,000
Shrs Shared-Defined 1 200,000


The file has many lines like this, but not only. I'm trying to find the
lines of this form, and within each line found, extract the companyId,
$amount and #shares.

To thie ens, I'm searching for the pattern "companyId $amount #shares". I
have something like

([[:alnum:]]{9})\s+(\$?number_pattern)\s+(number_pattern)

where number_pattern is something that matches numbers, with or without
commas. With [[:alnum:]]{9} for companyId, followed by the other two
patterns, in the example above I pick up

companyId = Preferred
$amount = 370442691
#shares = 4,602

which would be wrong (but the program thinks it's ok). I need to change
the companyId pattern from the simple minded [[:alnum:]]{9} to something
to include at least 1 digit. And keep the next two patterns.

Reply With Quote
  #4 (permalink)  
Old 05-26-2008
Amadeus W.M.
 
Posts: n/a
Default Re: regex question

On Mon, 26 May 2008 00:10:08 +0200, Marcel Bruinsma wrote:

> In article <pan.2008.05.25.17.19.45@verizon.net>,
> Amadeus W.M. wrote:
>
>> I need to find patterns like these (e.g. with sed or perl or grep):
>>
>> G1150G111
>> 00443E104
>>
>> etc. That is, 9 digit words made only of letters or digits, of which at
>> least one character is a digit. The letters can occur in random
>> positions.

>
> perl -ne '
> while (/(^|[^[:alnum:]])([[:alnum:]]{9})([^[:alnum:]]|$)/g) {
> if ($2 =~ /[[:digit:]]/) {print;last;}
> }' <infile >outfile
>
>
> Regards,
> Marcel


I guess I need something like

([[:digit:]][[:alnum:]]{8})|([[:alnum:]]{1}[[:digit:]][[:alnum:]]{7})| etc.

That is, keep moving the [[:digit:]] over each of the 9 possible
positions. Is there a smarter way to write this?

Reply With Quote
  #5 (permalink)  
Old 05-27-2008
Marcel Bruinsma
 
Posts: n/a
Default Re: regex question

In article <pan.2008.05.26.02.15.07@verizon.net>,
Amadeus W.M. wrote:

>>
>>> I need to find patterns like these (e.g. with sed or perl or grep):
>>>
>>> G1150G111
>>> 00443E104
>>>

>> perl -ne '
>> while (/(^|[^[:alnum:]])([[:alnum:]]{9})([^[:alnum:]]|$)/g) {
>> if ($2 =~ /[[:digit:]]/) {print;last;}
>> }' <infile >outfile

>
> Thanks! I'm not sure this will work for what I need though. Perhaps my
> initial question was incomplete. I have a file with many lines of the
> form
>
> company type companyId $amount #shares etc.
>
> For instance:
>
> GENERAL MTRS CORP Preferred 370442691 4,602 200,000
> Shrs Shared-Defined 1 200,000
>
>
> The file has many lines like this, but not only. I'm trying to find
> the lines of this form, and within each line found, extract the
> companyId, $amount and #shares.
>
> To thie ens, I'm searching for the pattern "companyId $amount
> #shares". I have something like
>
> ([[:alnum:]]{9})\s+(\$?number_pattern)\s+(number_pattern)
>
> where number_pattern is something that matches numbers, with or
> without commas. With [[:alnum:]]{9} for companyId, followed by the
> other two patterns, in the example above I pick up
>
> companyId = Preferred
> $amount = 370442691
> #shares = 4,602
>
> which would be wrong (but the program thinks it's ok). I need to
> change the companyId pattern from the simple minded [[:alnum:]]{9} to
> something to include at least 1 digit. And keep the next two patterns.


The '$2 =~ /[[:digit:]]/' is the check for 'at least one digit', but
only one expression is also possible, just more complicated:

#!/usr/bin/perl
$p =
'[[:blank:]]([[:digit:]][[:alnum:]]{8}|'
..'[[:alnum:]][[:digit:]][[:alnum:]]{7}|'
..'[[:alnum:]]{2}[[:digit:]][[:alnum:]]{6}|'
..'[[:alnum:]]{3}[[:digit:]][[:alnum:]]{5}|'
..'[[:alnum:]]{4}[[:digit:]][[:alnum:]]{4}|'
..'[[:alnum:]]{5}[[:digit:]][[:alnum:]]{3}|'
..'[[:alnum:]]{6}[[:digit:]][[:alnum:]]{2}|'
..'[[:alnum:]]{7}[[:digit:]][[:alnum:]]|'
..'[[:alnum:]]{8}[[:digit:]])'
..'[[:blank:]]+([[:digit:]]+,[[:digit:]]+)'
..'[[:blank:]]+([[:digit:]]+,[[:digit:]]+)';
while (<DATA>) {
if (/$p/o) {
$companyID = $1;
$amount = $2;
$shares = $3;
print "|$companyID|$amount|$shares|\n";
}
}
__END__
GENERAL MTRS CORP Preferred 370442691 4,602 200,000
Shrs Shared-Defined 1 200,000

If the companyIDs contain no lowercase, you should
replace all '[:alnum:]' by '[:upper:][:digit:]'.


Regards,
Marcel

--
printf -v email $(echo \ 155 141 162 143 145 154 155 141 162 \
143 145 154 100 157 162 141 156 147 145 56 156 154 | tr \ \\)
# O Herr, lass Hirn vom Himmel fallen! #
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 12:49 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0