Bug in regex engine ? Must be

D_Krmpotic · 3 March 2008 19:24

Hi,

I'm using Ruby 1.8.6, and I just discovered something rather
interesting, here is a test:

require 'test/unit'

class TestRegexBug < Test::Unit::TestCase

def test_bug

hours = "pon-čet"

    assert(hours =~ /[č]et/i)
    assert(hours =~ /čet/i)
    assert(hours =~ /-čet/i)
    assert(hours =~ /[cč]et/i)
    assert(hours =~ /-[č]et/i)

end

As you can see, this only happens with unicode letters... (the last test
fails).. I'm used to the fact that //i doesn't work for unicode chars
and I already know that you need two dots to match one of these.. But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets.. if you remove either the '-' or
'[]' from the regex, it works..

Can you comment?

thank you,
david

···

--
Posted via http://www.ruby-forum.com/.

Rob_Biedenharn1 · 3 March 2008 22:45

Hi,

I'm using Ruby 1.8.6, and I just discovered something rather
interesting, here is a test:

$KCODE = 'UTF8'
require 'jcode'

require 'test/unit'

class TestRegexBug < Test::Unit::TestCase

def test_bug

   hours = "pon-čet"

   assert(hours =~ /[č]et/i)
   assert(hours =~ /čet/i)
   assert(hours =~ /-čet/i)
   assert(hours =~ /[cč]et/i)
   assert(hours =~ /-[č]et/i)

end

end

As you can see, this only happens with unicode letters... (the last test
fails).. I'm used to the fact that //i doesn't work for unicode chars
and I already know that you need two dots to match one of these.. But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets.. if you remove either the '-' or
'' from the regex, it works..

Can you comment?

thank you,
david

Ruby is not natively aware of unicode, but you can get all these to pass if you give it the $KCOCDE hint.

-Rob

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

···

On Mar 3, 2008, at 2:24 PM, D. Krmpotic wrote:

Stefan_Lang1 · 8 March 2008 00:41

In the regex [č] is a character class with _two_ bytes. So
Ruby tries to match a minus followed by _one_ of the bytes
out of "č" followed by "et". So the regex would match
"pon-\304et" or "pon-\215et", but not "pon-\304\215et".

Stefan

···

2008/3/3, D. Krmpotic <david.krmpotic@gmail.com>:

Hi,

I'm using Ruby 1.8.6, and I just discovered something rather
interesting, here is a test:

require 'test/unit'

class TestRegexBug < Test::Unit::TestCase

  def test_bug

    hours = "pon-čet"

    assert(hours =~ /[č]et/i)
    assert(hours =~ /čet/i)
    assert(hours =~ /-čet/i)
    assert(hours =~ /[cč]et/i)
    assert(hours =~ /-[č]et/i)

  end

end

As you can see, this only happens with unicode letters... (the last test
fails).. I'm used to the fact that //i doesn't work for unicode chars
and I already know that you need two dots to match one of these.. But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets.. if you remove either the '-' or
'' from the regex, it works..

D_Krmpotic · 7 March 2008 21:30

Great info.. completely forgot that this is available...
thank you
david

···

$KCODE = 'UTF8'
require 'jcode'

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Unicode in Regex ruby-talk	32	387	7 December 2007
Can’t report a bug, and a bug in Regexp + UTF-8 + //i ruby-talk	2	422	6 September 2019
Ruby-dev summary 19944 - 19957 ruby-talk	10	163	8 April 2003
I think this is a regexp bug ruby-talk	2	122	25 May 2007
Why is this regex invalid? ruby-talk	8	162	7 December 2006

Bug in regex engine ? Must be

Related topics