Regex interpolation (in ruby from CVS)

I have been experimenting with building regular expressions from
components with ruby from CVS and have noticed some unexpected
behaviour.

To take a simple case, which works as expected:

[mike@ratdog src]$ irb --simple-prompt

m = /[Mm]ike/
=> /[Mm]ike/
s = /[Ss]tok/
=> /[Ss]tok/
‘mike stok’ =~ /^#{m} #{s}$/
=> 0
‘Mike Stok’ =~ /^#{m} #{s}$/
=> 0

when I try setting up m and s to be case insensitive then

m = /mike/i
=> /mike/i
s= /stok/i
=> /stok/i
‘mike stok’ =~ /^#{m} #{s}$/
=> nil

I would have expected that to work, so I look at what the regular
expression looks like:

=> /^(?i-mx:mike) (?i-mx:stok)$/

‘mike stok’ =~ /^(?i-mx:mike) (?i-mx:stok)$/
=> nil

I expected 0 here, not nil

‘mike stok’ =~ /^(?i:mike) (?i:stok)$/
=> 0

Removing the -mx seems to make it work, and as m and x modifiers don’t
matter to this match I try with them:

‘Mike Stok’ =~ /^(?imx:mike) (?imx:stok)$/
=> 0

Is there a flaw in the handling of the - in (?i-mx: … ) in the regular
expression? In Perl it seems to work, so I don’t think my expectation
is too wild:

DB<1> print ‘mike stok’ =~ /^(?i-mx:mike) (?i-mx:stok)$/
1

It would be good if this worked, as I can then build regular expressions
out of sub-expressions without interpolating strings…

Is this a flaw in Ruby or my expectations?

Mike

···


mike@stok.co.uk | The “`Stok’ disclaimers” apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike@exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA

“Mike Stok” mike@ratdog.stok.co.uk schrieb im Newsbeitrag
news:4dlGa.75319$G_.36633@news02.bloor.is.net.cable.rogers.com

I have been experimenting with building regular expressions from
components with ruby from CVS and have noticed some unexpected
behaviour.

To take a simple case, which works as expected:

[mike@ratdog src]$ irb --simple-prompt

m = /[Mm]ike/
=> /[Mm]ike/
s = /[Ss]tok/
=> /[Ss]tok/
‘mike stok’ =~ /^#{m} #{s}$/
=> 0
‘Mike Stok’ =~ /^#{m} #{s}$/
=> 0

when I try setting up m and s to be case insensitive then

m = /mike/i
=> /mike/i
s= /stok/i
=> /stok/i
‘mike stok’ =~ /^#{m} #{s}$/
=> nil

I would have expected that to work, so I look at what the regular
expression looks like:

=> /^(?i-mx:mike) (?i-mx:stok)$/

‘mike stok’ =~ /^(?i-mx:mike) (?i-mx:stok)$/
=> nil

I expected 0 here, not nil

‘mike stok’ =~ /^(?i:mike) (?i:stok)$/
=> 0

Removing the -mx seems to make it work, and as m and x modifiers don’t
matter to this match I try with them:

‘Mike Stok’ =~ /^(?imx:mike) (?imx:stok)$/
=> 0

Is there a flaw in the handling of the - in (?i-mx: … ) in the regular
expression? In Perl it seems to work, so I don’t think my expectation
is too wild:

DB<1> print ‘mike stok’ =~ /^(?i-mx:mike) (?i-mx:stok)$/
1

It would be good if this worked, as I can then build regular expressions
out of sub-expressions without interpolating strings…

Is this a flaw in Ruby or my expectations?

I think the problem is, that you use #{m} to insert a regexp into another:

irb(main):006:0> m = /[Mm]ike/
/[Mm]ike/
irb(main):007:0> m.to_s
“(?-mix:[Mm]ike)”
irb(main):008:0> m.source
“[Mm]ike”
irb(main):009:0>

You’d be better off if you use

‘mike stok’ =~ /^#{m.source} #{s.source}$/

or make m and s strings in the first place.

Cheers

robert

'mike stok' =~ /^(?i-mx:mike) (?i-mx:stok)$/

=> nil

Can you try this (1.6.8)

pigeon% diff -u regex.c.old regex.c
--- regex.c.old 2003-06-13 17:16:18.000000000 +0200
+++ regex.c 2003-06-13 17:16:51.000000000 +0200
@@ -1011,6 +1011,7 @@
       break;

     case duplicate:
+ case option_set:
       p++;
       break;

@@ -1036,7 +1037,6 @@
     case push_dummy_failure:
     case start_paren:
     case stop_paren:
- case option_set:
       break;

     case charset:
pigeon%

pigeon% ./ruby -e 'p ("mike stok" =~ /^(?i-mx:mike) (?i-mx:stok)$/)'
0
pigeon%

Guy Decoux

In article 200306131529.h5DFTNs29165@moulon.inra.fr,

‘mike stok’ =~ /^(?i-mx:mike) (?i-mx:stok)$/
=> nil

Can you try this (1.6.8)

pigeon% diff -u regex.c.old regex.c
— regex.c.old 2003-06-13 17:16:18.000000000 +0200
+++ regex.c 2003-06-13 17:16:51.000000000 +0200
@@ -1011,6 +1011,7 @@
break;

case duplicate:
  • case option_set:
    p++;
    break;

@@ -1036,7 +1037,6 @@
case push_dummy_failure:
case start_paren:
case stop_paren:

  • case option_set:
    break;
case charset:

Thanks! I did the moral equivalent to the 1.8 sources from CVS and now:

[mike@ratdog mike]$ irb --simple-prompt

mike = /mike/
=> /mike/
mike_i = /mike/i
=> /mike/i
‘Mike Stok’ =~ /^#{mike} Stok$/
=> nil
‘Mike Stok’ =~ /^#{mike_i} Stok$/
=> 0

This makes it much easier to build complex regular expressions :slight_smile:

Thanks again,

Mike

···

ts decoux@moulon.inra.fr wrote:


mike@stok.co.uk | The “`Stok’ disclaimers” apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike@exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA

In article bccorf$i3gqt$1@ID-52924.news.dfncis.de,

I think the problem is, that you use #{m} to insert a regexp into another:

irb(main):006:0> m = /[Mm]ike/
/[Mm]ike/
irb(main):007:0> m.to_s
“(?-mix:[Mm]ike)”
irb(main):008:0> m.source
“[Mm]ike”
irb(main):009:0>

You’d be better off if you use

‘mike stok’ =~ /^#{m.source} #{s.source}$/

or make m and s strings in the first place.

The problem with this is that .source seems to lose information (the /i
modifier, for example):

[mike@ratdog mike]$ irb --simple-prompt

m = /mike/i
=> /mike/i
‘Mike’ =~ /#{m}/
=> 0
‘Mike’ =~ /#{m.source}/
=> nil

Thanks,

Mike

···

Robert Klemme bob.news@gmx.net wrote:


mike@stok.co.uk | The “`Stok’ disclaimers” apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike@exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA

In article 200306131529.h5DFTNs29165@moulon.inra.fr,

‘mike stok’ =~ /^(?i-mx:mike) (?i-mx:stok)$/
=> nil

Can you try this (1.6.8)

pigeon% diff -u regex.c.old regex.c
— regex.c.old 2003-06-13 17:16:18.000000000 +0200
+++ regex.c 2003-06-13 17:16:51.000000000 +0200
@@ -1011,6 +1011,7 @@
break;

case duplicate:
  • case option_set:
    p++;
    break;

@@ -1036,7 +1037,6 @@
case push_dummy_failure:
case start_paren:
case stop_paren:

  • case option_set:
    break;
case charset:

pigeon%

pigeon% ./ruby -e ‘p (“mike stok” =~ /^(?i-mx:mike) (?i-mx:stok)$/)’
0
pigeon%

This works well but seems to break (under 1.8.0) if I use a trivial
character class. As a test:

#!/usr/bin/env ruby

require ‘test/unit’

class TC_RegExInterpolate < Test::Unit::TestCase
def test_simple
m = /mike/ # case sensitive
m_i = /mike/i # case insensitive
m_c = /[Mm]ike/ # char class - allow upper or lower M
m_c2 = /[M-m]ike/ # char class - escaped -

assert_nil(     'Mike' =~ /#{m}/,    'fail with case sensitive')

assert_equal(0, 'Mike' =~ /#{m_i}/,  'match with case insensitive (Mike)')
assert_equal(0, 'mIke' =~ /#{m_i}/,  'match with case insensitive (mIke)')

assert_equal(0, 'Mike' =~ /#{m_c}/,  'match with char class (Mike)')
assert_nil(     'mIke' =~ /#{m_c}/,  'fail with char class (mIke)')
assert_equal(0, 'Mike' =~ /#{m_c}/,  'match with char class (Mike)')

assert_equal(0, '-ike' =~ /#{m_c2}/, 'match with char class 2 (-ike)')

end
end

I get this output:

Started
F
Finished in 0.000917 seconds.

  1. Failure!!!
    test_simple(TC_RegExInterpolate) [regex.rb:15]:
    match with char class (Mike).
    <0> expected but was

1 tests, 4 assertions, 1 failures, 0 errors

Mike

···

ts decoux@moulon.inra.fr wrote:


mike@stok.co.uk | The “`Stok’ disclaimers” apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike@exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA

As an almost completely irrelevant nit…

  1. Failure!!!

Could the unit test framework author have the “failure” message just
spit out “Failure.”, or possibly “Failure!”. 3 bangs is, well, I
just don’t like being yelled at.

It might as well be “FaIluR3!!!111”.

(Ok, I’m having a crappy day.)

···

Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook™.
http://calendar.yahoo.com

    m_c = /[Mm]ike/ # char class - allow upper or lower M

Well, this is this case

pigeon% /usr/bin/ruby -e 'p ("Mike" =~ /(?-i)[Mm]ike/)'
nil
pigeon%

pigeon% diff -u regex.c.old regex.c
--- regex.c.old 2003-06-13 17:16:18.000000000 +0200
+++ regex.c 2003-06-14 09:58:41.000000000 +0200
@@ -1011,6 +1011,7 @@
       break;

     case duplicate:
+ case option_set:
       p++;
       break;

@@ -1036,7 +1037,6 @@
     case push_dummy_failure:
     case start_paren:
     case stop_paren:
- case option_set:
       break;

     case charset:
@@ -2798,8 +2798,11 @@

       case casefold_on:
        bufp->options |= RE_MAY_IGNORECASE;
+ options |= RE_OPTION_IGNORECASE;
+ continue;

···

+
       case casefold_off:
- options ^= RE_OPTION_IGNORECASE;
+ options &= ~RE_OPTION_IGNORECASE;
        continue;

       case option_set:
pigeon%

pigeon% ./ruby -e 'p ("Mike" =~ /(?-i)[Mm]ike/)'
0
pigeon%

Guy Decoux

In article 200306140802.h5E82tm10133@moulon.inra.fr,

[Mike grumbles about constructing regexes:-)]

Well, this is this case

[Guy fixes it with a minimal patch]

Excellent, I was thinking about the way regexes/regexps were constructed
in REXML, and then saw Regexp Power
… It seems awkward to me to use strings to interpolate into a regex,
especially when you get into counting backslashes to account for the
double interpolation if you say something lke

space = ‘[ \t]’ # vs. space = /[ \t]/
if foo =~ /bar #{space} bax/x

Your patch allows my test case to work and doesn’t seem to break
anything else.

I think it would be good if this made it into 1.8.

Thanks (again :slight_smile:

Mike

Test session:

[mike@ratdog tmp]$ ruby regex.rb
Loaded suite regex
Started

Finished in 0.001459 seconds.

5 tests, 13 assertions, 0 failures, 0 errors
[mike@ratdog tmp]$ cat regex.rb
#!/usr/bin/env ruby

require ‘test/unit’

class TC_RegExInterpolate < Test::Unit::TestCase
def test_case_sensitive
m = /mike/
assert_nil( ‘Mike’ =~ /#{m}/, ‘fail Mike’)
assert_equal(0, ‘mike’ =~ /#{m}/, ‘match mike’)
end

def test_case_insensitive
m = /mike/i
assert_equal(0, ‘Mike’ =~ /#{m}/, ‘match Mike’)
assert_equal(0, ‘mIke’ =~ /#{m}/, ‘match mIke’)
assert_nil( ‘pIke’ =~ /#{m}/, ‘fail pIke’)
end

def test_simple_char_class
m = /[Mm]ike/
assert_equal(0, ‘Mike’ =~ /#{m}/, ‘match Mike’)
assert_nil( ‘mIke’ =~ /#{m}/, ‘fail mIke’)
end

def test_char_class_with_minus
m = /[M-m]ike/
assert_equal(0, ‘Mike’ =~ /#{m}/, ‘match Mike’)
assert_equal(0, ‘-ike’ =~ /#{m}/, ‘match -ike’)
assert_nil( ‘mIke’ =~ /#{m}/, ‘fail mIke’)
end

def test_cozens_article
# From Regexp Power

···

ts decoux@moulon.inra.fr wrote:
#
# my $post_town = qr/[A-Z]{1,2}/;
# my $area = qr/\d{1,3}/;
# my $space = qr/[ \t]+/;
# my $region = qr/\d{1,2}/;
# my $street = qr/[A-Z][A-Z]/;
#
# And we can also add modifiers to parts of a quoted regular expression:
#
# my $uk_postcode = qr/$post_town $area $space $region $street/x;

post_town = /[A-Z]{1,2}/
area      = /\d{1,3}/
space     = /[ \t]+/
region    = /\d{1,2}/
street    = /[A-Z][A-Z]/

uk_postcode = /#{post_town} #{area} #{space} #{region} #{street}/x

assert_equal(0, 'BS8 4YA'  =~ /#{uk_postcode}/, "good UK test")
assert_nil(     'TX 78731' =~ /#{uk_postcode}/, "UK pattern against US code")

# my $prefix = qr/zip code: /i;
# my $code   = qr/[A-Z][A-Z][ \t]+\d{5}/;
#
# $address =~ /$prefix $code/x;

prefix = /zip code: /i
code   = /[A-Z][A-Z][ \t]+\d{5}/

assert_equal(0, 'Zip Code: TX 78731' =~ /#{prefix} #{code}/x, "good zip test")

end
end


mike@stok.co.uk | The “`Stok’ disclaimers” apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike@exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA