Upcase in 1.9.2-preview1

Can someone confirm whether this is intentional or not?

RUBY_DESCRIPTION

=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"

s = "über"

=> "über"

s.upcase

=> "üBER"

That is, a lower-case "ü" is not uppercased to "Ü". And yet, "Ü" is
detected as an upper-case letter:

"ÜBER" =~ /[[:upper:]]/

=> 0

"ÜBER" =~ /[[:lower:]]/

=> nil

Thanks,

Brian.

···

--
Posted via http://www.ruby-forum.com/\.

Hi,

Can someone confirm whether this is intentional or not?

>> RUBY_DESCRIPTION
=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"
>> s = "über"
=> "über"
>> s.upcase
=> "üBER"

That is, a lower-case "ü" is not uppercased to "Ü". And yet, "Ü" is
detected as an upper-case letter:

>> "ÜBER" =~ /[[:upper:]]/
=> 0
>> "ÜBER" =~ /[[:lower:]]/
=> nil

By the way: I detected that there are some unicode characters where
it is not clear whether they are up- or downcase. For example the
DZ digraph has a version Dz that is the downcased version of DZ
and the upcased version of dz.

  U+01F1 DZ
  U+01F2 Dz
  U+01F3 dz

  Latin Extended-B - Wikipedia

Vim's ~ operator cycles through the three values. How will or
should Ruby treat them?

Bertram

···

Am Mittwoch, 29. Jul 2009, 16:43:36 +0900 schrieb Brian Candler:

--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de

Brian Candler wrote:

Can someone confirm whether this is intentional or not?

s = "über"

=> "über"

s.upcase

=> "üBER"

To answer my own question: looking at the source code, it looks like
this *is* intentional. From encoding.c:

int
rb_enc_toupper(int c, rb_encoding *enc)
{
    return
(ONIGENC_IS_ASCII_CODE(c)?ONIGENC_ASCII_CODE_TO_UPPER_CASE(c):(c));
}

int
rb_enc_tolower(int c, rb_encoding *enc)
{
    return
(ONIGENC_IS_ASCII_CODE(c)?ONIGENC_ASCII_CODE_TO_LOWER_CASE(c):(c));
}

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

···

--
Posted via http://www.ruby-forum.com/\.

Bertram Scharpf wrote:

Vim's ~ operator cycles through the three values. How will or
should Ruby treat them?

I only have access to a slightly older 1.9.2 here, but:

RUBY_DESCRIPTION

=> "ruby 1.9.2dev (2009-04-08 trunk 23158) [i686-linux]"

["\u01f1", "\u01f2", "\u01f3"].each { |c| puts c =~ /[[:lower:]]/ }

0
=> ["DZ", "Dz", "dz"]

["\u01f1", "\u01f2", "\u01f3"].each { |c| puts c =~ /[[:upper:]]/ }

0

=> ["DZ", "Dz", "dz"]

So the first is upper, the third is lower, and the second is neither :slight_smile:

upcase/downcase does not affect any of them - but I'm not sure if the
current behaviour is correct, which is why I started this thread.

···

--
Posted via http://www.ruby-forum.com/\.

Is it the correct approach?
For me it's very clear that the upcase version of á is Á.

···

2009/7/29 Brian Candler <b.candler@pobox.com>:

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

--
Iñaki Baz Castillo
<ibc@aliax.net>

Iñaki Baz Castillo wrote:

···

2009/7/29 Brian Candler <b.candler@pobox.com>:

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

Is it the correct approach?
For me it's very clear that the upcase version of á is Á.

There are perfectly clear Unicode rules for case conversion, but they
are not simple. In some cases you need to replace one character by two
(e.g. ß to SS)

There is a useful discussion about this from Python's point of view
here:

--
Posted via http://www.ruby-forum.com/\.

Hi,

···

In message "Re: upcase in 1.9.2-preview1" on Thu, 30 Jul 2009 02:13:41 +0900, Iñaki Baz Castillo <ibc@aliax.net> writes:

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

Is it the correct approach?
For me it's very clear that the upcase version of á is Á.

But it's locale dependent. In some languages, upper/lower case
conversion is not one-to-one mapping.

              matz.

Yukihiro Matsumoto wrote:

>> That is: only ASCII characters (potentially encoded as UTF16 or
>> whatever) are eligible for case conversion.
>
>Is it the correct approach?
>For me it's very clear that the upcase version of á is Á.

But it's locale dependent. In some languages, upper/lower case
conversion is not one-to-one mapping.

99% of the time it's locale independant. I don't follow this logic of "if we can't make it work for everyone then it should stay broken for everyone". Following the Unicode rules would fix uppercasing for 95% of those not using english. And for the 5% of those who have to deal with those asymmetric upper/lower case rules... well, they'd have to deal with it either way.

*percentages above are guesswork

···

In message "Re: upcase in 1.9.2-preview1" > on Thu, 30 Jul 2009 02:13:41 +0900, Iñaki Baz Castillo <ibc@aliax.net> writes: