Win32 ruby1.9 regexp and cyrillic string

#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8>
p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why /\w/ ignore cyrillic characters?

I have installed latest ruby package from http://rubyinstaller.org/.
Here is my output of ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

···

--
Posted via http://www.ruby-forum.com/.

str2.gsub!(/\w/u,'') #removes only latin characters

The question is why /\w/ ignore cyrillic characters?

Are cyrillic characters supposed to count as "word characters"? (\w) ?
If so then looks like a bug to me. Ping core.
-rp

···

--
Posted via http://www.ruby-forum.com/.

Nikolay Khodyunya wrote:

#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8>
p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why /\w/ ignore cyrillic characters?

I have installed latest ruby package from http://rubyinstaller.org/.
Here is my output of ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

http://redmine.ruby-lang.org/issues/show/3181
http://redmine.ruby-lang.org/issues/show/3202

might be related. If you think it's wrong then bring it up on core.
-rp

···

--
Posted via http://www.ruby-forum.com/.

I think that \w (and similar shortcuts) is supposed to match ascii
characters only... thus it's equivalent to [a-zA-Z].

Isn't there some kind of unicode character class you can use?

···

On 4/27/10, Nikolay Khodyunya <nickolayho@gmail.com> wrote:

#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8>
p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why /\w/ ignore cyrillic characters?

Caleb Clausen wrote:

···

On 4/27/10, Nikolay Khodyunya <nickolayho@gmail.com> wrote:

#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8>
p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why /\w/ ignore cyrillic characters?

I think that \w (and similar shortcuts) is supposed to match ascii
characters only... thus it's equivalent to [a-zA-Z].

Isn't there some kind of unicode character class you can use?

Actually "asdfМикимаус".gsub!(/\w/u,'') returns "" on linux so the
problem is from the windows package.

you can use "asdfМикимаус".gsub!(/\p{L}/,'') to remove letters thought
--
Posted via http://www.ruby-forum.com/.

Actually "asdfМикимаус".gsub!(/\w/u,'') returns "" on linux so the
problem is from the windows package.

you can use "asdfМикимаус".gsub!(/\p{L}/,'') to remove letters thought

If they're the same version then it might be a window bug. Try it with
trunk and if it still fails then submit a bug report to the tracker...

···

--
Posted via http://www.ruby-forum.com/.

Roger Pack wrote:

Actually "asdfМикимаус".gsub!(/\w/u,'') returns "" on linux so the
problem is from the windows package.

Here's a copy of trunk if that would be useful:

http://rubydoc.ruby-forum.com/ruby_distros/ruby_trunk_no_patches_installed.7z

GL.
-rp

···

--
Posted via http://www.ruby-forum.com/.