Ruby 1.9.2: /\w/u does not match umlauts ("ü")

Andreas_S4 · 29 September 2010 13:42

I found that, unlike Ruby 1.8, the word character class in Ruby 1.9
regexes does not match german umlauts (or any other letters other than
ASCII). According to the oniguruma documentation
(http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt), it should match
everything from the unicode "letter" category, which includes umlauts.

test.rb (also attached):
# encoding: utf-8
$KCODE='u'
s = "ü"
puts s.match(/\w/u).inspect

Result with ruby 1.8:
#<MatchData "ü">

Result with ruby 1.9.2:
nil

Is that a bug, or is there any reason behind this behavior?

Attachments:
http://www.ruby-forum.com/attachment/5113/test.rb

···

--
Posted via http://www.ruby-forum.com/.

Andreas_S4 · 29 September 2010 15:18

After some more googleing I found this bug report (don't know why I
didn't catch it earlier) that states that this is the desired behavior:
http://redmine.ruby-lang.org/issues/show/3181

Still, I don't understand the motivation for making this change.

···

--
Posted via http://www.ruby-forum.com/.

Roger_Pack4 · 29 September 2010 15:36

Andreas S. wrote:

I found that, unlike Ruby 1.8, the word character class in Ruby 1.9
regexes does not match german umlauts (or any other letters other than
ASCII). According to the oniguruma documentation
(サービス終了のお知らせ), it should match
everything from the unicode "letter" category, which includes umlauts.

so it's intended, however if you extremely dislike this, then complain
about it since apparently it's surprising to a number of people

-r

···

--
Posted via http://www.ruby-forum.com/\.

Andreas_S4 · 29 September 2010 15:59

Ruby Talk FAQ · rdp/ruby_tutorials_core Wiki · GitHub

"Basically at a certain patch level of 1.9.1, \w was set to no longer
match unicode characters, because the core developers were concerned
that this was not what people expected from \w."

Well, 1.9.2 behaving differently than 1.9.1 and 1.8 is certainly less
expected.

Apparently in 1.9 \p{Word} can be used instead of \w to match unicode
characters; however I did not find any documentation for this ("word"
it's not a unicode character category).

···

--
Posted via http://www.ruby-forum.com/\.

Roger_Pack4 · 29 September 2010 20:15

Well, 1.9.2 behaving differently than 1.9.1 and 1.8 is certainly less
expected.

yeah. 1.9.1 behaving differently with a different *patch level* is less
than expected, too.

Apparently in 1.9 \p{Word} can be used instead of \w to match unicode
characters; however I did not find any documentation for this ("word"
it's not a unicode character category).

That's odd that there's no standard. Maybe ruby made this up on their
own, then?

I think it's mentioned briefly

http://svn.ruby-lang.org/repos/ruby/trunk/doc/re.rdoc

···

--
Posted via http://www.ruby-forum.com/\.

Piotr_Szotkowski · 2 October 2010 13:55

Andreas S.:

Apparently in 1.9 \p{Word} can be used instead of \w to match unicode
characters; however I did not find any documentation for this ("word"
it's not a unicode character category).

You can also use the (documneted) \p{L} property:

chastell@devielle:~$ ruby -ve "p '℉üüü' =~ /\p{L}/"
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
1

BTW: I find Regex Tutorial - Unicode Characters and Properties
most useful.

— Piotr Szotkowski

···

--
I should like to find the person who decided that since
‘bookmarks’ and ‘history’ were both lists of URLs they
ought to be integrated in a single database. I should like
to shake him warmly by the throat until his head comes off.
[Roger Burton West on Firefox, hates-software]

Topic		Replies	Views
Win32 ruby1.9 regexp and cyrillic string ruby-talk	6	144	10 May 2010
Strange unicode regex behavior with Ruby 2.0 ruby-talk	3	167	27 March 2013
RegEx Unicode Character ruby-talk	1	120	29 December 2011
Is \d supposed to match Unicode Numbers? ruby-talk	6	188	10 August 2011
Unicode in Regex ruby-talk	32	387	7 December 2007

Ruby 1.9.2: /\w/u does not match umlauts ("ü")

Related topics