Is \d supposed to match Unicode Numbers?

I posted this as a question here:

Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the Unicode "Decimal_Number" category. However, in Ruby 1.9.1 and 1.9.2 it only matches Latin 0-9 characters. Is this the correct behavior for Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not apply to how Oniguruma is used within Ruby?

Test program:

#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

Feel free to discuss here, or answer on Stack Overflow if you have a solid answer and want the rep :slight_smile:

[1] http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

···

--
(-, /\ \/ / /\/

Gavin Kistner wrote in post #1015799:

Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

irb(main):001:0>
"0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…".scan(/\d/)
=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
irb(main):002:0>
"0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…".scan(/[[:digit:]]/)
=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "٠", "١", "٢",
"٣", "٤", "٥", "٦", "٧", "٨", "٩", "۰", "۱", "۲", "۳", "۴", "۵", "۶",
"۷", "۸", "۹", "߀", "߁", "߂", "߃", "߄", "߅", "߆", "߇", "߈", "߉", "०",
"१", "२", "३", "४", "५", "६", "७", "८", "९", "০", "১", "২", "৩", "৪",
"৫", "৬", "৭", "৮", "৯", "੦", "੧", "੨"]
irb(main):003:0>

irb(main):004:0> "abcdé".scan(/\w/)
=> ["a", "b", "c", "d"]
irb(main):005:0> "abcdé".scan(/[[:alpha:]]/)
=> ["a", "b", "c", "d", "é"]

So I think it's intentional and consistent behaviour (for some
definition of consistent):

* \w and \d match only Latin letters and digits
* [[:alpha:]] and [[:digit:]] match the full unicode set

···

--
Posted via http://www.ruby-forum.com/\.

I posted this as a question here:

Summarized:
The Oniguruma docs[1] seem to say that \d is supposed to match the Unicode "Decimal_Number" category. However, in Ruby 1.9.1 and 1.9.2 it only matches Latin 0-9 characters. Is this the correct behavior for Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not apply to how Oniguruma is used within Ruby?

[1] http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

It seems that the above Oniguruma document is not directly applicable to Ruby 1.9. See this ticket discussion[2], which includes these posts:

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."

This discussion was from almost 2 years ago, but sadly I have not been able to find an official Ruby 1.9 version of RE.txt.

[2] http://redmine.ruby-lang.org/issues/1889#note-28

···

On Aug 09, 2011, at 02:28 PM, Gavin Kistner <phrogz@me.com> wrote:

Gavin Kistner wrote in post #1015799:

Is this the correct behavior for
Ruby? Are the Oniguruma docs wrong? Am I misreading them? Do they not
apply to how Oniguruma is used within Ruby?

* \w and \d match only Latin letters and digits
* [[:alpha:]] and [[:digit:]] match the full unicode set

Definitely helpful in achieving the end goal - thanks!

Any guess as to how to reconcile this behavior with what the Oniguruma "ONIG_SYNTAX_RUBY" document says? Looking at the secton on \w, we may have a clue:

\\w  word character
    Not Unicode: alphanumeric, &quot;\_&quot; and multibyte char\. 
    Unicode: General\_Category \-\- \(Letter|Mark|Number|Connector\_Punctuation\)
\[\.\.\]
\\d  decimal digit char
    Unicode: General\_Category \-\- Decimal\_Number

Perhaps "Not Unicode" means "this is how it behaves in some non-Unicode mode", and "Unicode" means "this is how it behaves in some Unicode mode". And perhaps missing from the doc for \d is something like "Not Unicode: 0-9".

If that is correct then the next question for me is how to enable Unicode-mode for Oniguruma. The /u flag on a regexp does not do it, since:

&quot;abç&quot;\.scan\(/\\w/\) == &quot;abç&quot;\.scan\(/\\w/u\)
···

On Aug 09, 2011, at 03:38 PM, Brian Candler <b.candler@pobox.com> wrote:

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it's pretty accurate:

[...]
* /\d/ - A digit character ([0-9])
[...]
For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas
/[[:digit:]]/ matches any character in the Unicode Nd category
[...]

HTH,
- Markus

···

On 10.08.2011 00:52, Gavin Kistner wrote:

This discussion was from almost 2 years ago, but sadly I have not been
able to find an official Ruby 1.9 version of RE.txt.

It does help, thanks! :slight_smile:

···

On Aug 9, 2011, at 5:25 PM, Markus Fischer wrote:

On 10.08.2011 00:52, Gavin Kistner wrote:

This discussion was from almost 2 years ago, but sadly I have not been
able to find an official Ruby 1.9 version of RE.txt.

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it's pretty accurate:
[…]
HTH,

This same content is found in:

ri Regexp

···

On Aug 9, 2011, at 16:25, Markus Fischer <markus@fischer.name> wrote:

Hmm, I almost always refer to
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc for the regex doc
and therein it's pretty accurate:

[...]
* /\d/ - A digit character ([0-9])
[...]
For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas
/[[:digit:]]/ matches any character in the Unicode Nd category
[...]