Hello all,
I’m sorry to post so late. This is a summary of ruby-dev ML
last week.
[ruby-dev:20491] [Oniguruma] explicit capture
[ruby-dev:20514] [Oniguruma] Version 1.9.1
Recently, the translation of '‘Mastering Regular Expressions’'
2nd ed. was published in Japan. Kosako, the author of Oniguruma,
read it and found the ExplicitCapture option in .NET, which will
canceled groups except named groups. So Kosako added an option
REG_OPTION_CAPTURE_ONLY_NAMED_GROUP and a notation (?n:…)
in Oniguruma 1.9.1.
But Tanaka Akira pointed out that Ruby already used /n option,
and proposed using /c option instead of /n. Kosako agreed
Tanaka’s idea.
[ruby-dev:20495] matching with invalid byte sequence
Kazuhiro NISHIYAMA pointed out that /./ matched with an invalid
byte sequence in UTF-8.
require 'uconv’
if /./u =~ "\xa3"
Uconv.u8toeuc($&) #=> illegal UTF-8 sequence (a3) (Uconv::Error)
end
But ‘/./s =~ “\xF1”’ and ‘/./e =~ “\xF6”’ don’t match.
So he suggested that /./ should match one character, even if
$KCODE is UTF-8.
Nobu answered that Ruby’s regexp doesn’t check whether multi-byte
character sequence is valid or not, at least in current Ruby.
And the reason why /./s and /./e don’t match “\xF1” and "\xF6"
each other is that each string should be considered first byte
of multi-byte character, but followed by no trailing bytes.
Regards,
TAKAHASHI ‘Maki’ Masayoshi E-mail: maki@rubycolor.org