Upcase in 1.9.2-preview1

Brian_Candler · 29 July 2009 07:43

Can someone confirm whether this is intentional or not?

RUBY_DESCRIPTION

=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"

s = "über"

=> "über"

s.upcase

=> "üBER"

That is, a lower-case "ü" is not uppercased to "Ü". And yet, "Ü" is
detected as an upper-case letter:

"ÜBER" =~ /[[:upper:]]/

=> 0

"ÜBER" =~ /[[:lower:]]/

=> nil

Thanks,

Brian.

···

--
Posted via http://www.ruby-forum.com/\.

Bertram_Scharpf · 29 July 2009 10:07

Hi,

Can someone confirm whether this is intentional or not?

>> RUBY_DESCRIPTION
=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"
>> s = "über"
=> "über"
>> s.upcase
=> "üBER"

That is, a lower-case "ü" is not uppercased to "Ü". And yet, "Ü" is
detected as an upper-case letter:

>> "ÜBER" =~ /[[:upper:]]/
=> 0
>> "ÜBER" =~ /[[:lower:]]/
=> nil

By the way: I detected that there are some unicode characters where
it is not clear whether they are up- or downcase. For example the
DZ digraph has a version Dz that is the downcased version of DZ
and the upcased version of dz.

  U+01F1 Ǳ
  U+01F2 ǲ
  U+01F3 ǳ

Latin Extended-B - Wikipedia

Vim's ~ operator cycles through the three values. How will or
should Ruby treat them?

Bertram

···

Am Mittwoch, 29. Jul 2009, 16:43:36 +0900 schrieb Brian Candler:

--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de

Brian_Candler · 29 July 2009 10:32

Brian Candler wrote:

Can someone confirm whether this is intentional or not?

s = "über"

=> "über"

s.upcase

=> "üBER"

To answer my own question: looking at the source code, it looks like
this *is* intentional. From encoding.c:

int
rb_enc_toupper(int c, rb_encoding *enc)
{
return
(ONIGENC_IS_ASCII_CODE(c)?ONIGENC_ASCII_CODE_TO_UPPER_CASE(c):(c));
}

int
rb_enc_tolower(int c, rb_encoding *enc)
{
return
(ONIGENC_IS_ASCII_CODE(c)?ONIGENC_ASCII_CODE_TO_LOWER_CASE(c):(c));
}

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

···

--
Posted via http://www.ruby-forum.com/\.

Brian_Candler · 29 July 2009 10:21

Bertram Scharpf wrote:

Vim's ~ operator cycles through the three values. How will or
should Ruby treat them?

I only have access to a slightly older 1.9.2 here, but:

RUBY_DESCRIPTION

=> "ruby 1.9.2dev (2009-04-08 trunk 23158) [i686-linux]"

["\u01f1", "\u01f2", "\u01f3"].each { |c| puts c =~ /[[:lower:]]/ }

0
=> ["Ǳ", "ǲ", "ǳ"]

["\u01f1", "\u01f2", "\u01f3"].each { |c| puts c =~ /[[:upper:]]/ }

0

=> ["Ǳ", "ǲ", "ǳ"]

So the first is upper, the third is lower, and the second is neither

upcase/downcase does not affect any of them - but I'm not sure if the
current behaviour is correct, which is why I started this thread.

···

--
Posted via http://www.ruby-forum.com/\.

Inaki_Baz_Castillo · 29 July 2009 17:13

Is it the correct approach?
For me it's very clear that the upcase version of á is Á.

···

2009/7/29 Brian Candler <b.candler@pobox.com>:

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

--
Iñaki Baz Castillo
<ibc@aliax.net>

Brian_Candler · 29 July 2009 17:28

Iñaki Baz Castillo wrote:

···

2009/7/29 Brian Candler <b.candler@pobox.com>:

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

Is it the correct approach?
For me it's very clear that the upcase version of á is Á.

There are perfectly clear Unicode rules for case conversion, but they
are not simple. In some cases you need to replace one character by two
(e.g. ß to SS)

There is a useful discussion about this from Python's point of view
here:

--
Posted via http://www.ruby-forum.com/\.

Yukihiro_Matsumoto2 · 29 July 2009 17:32

Hi,

···

In message "Re: upcase in 1.9.2-preview1" on Thu, 30 Jul 2009 02:13:41 +0900, Iñaki Baz Castillo <ibc@aliax.net> writes:

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

Is it the correct approach?
For me it's very clear that the upcase version of á is Á.

But it's locale dependent. In some languages, upper/lower case
conversion is not one-to-one mapping.

matz.

Daniel_DeLorme · 3 August 2009 12:54

Yukihiro Matsumoto wrote:

>> That is: only ASCII characters (potentially encoded as UTF16 or
>> whatever) are eligible for case conversion.
>
>Is it the correct approach?
>For me it's very clear that the upcase version of á is Á.

But it's locale dependent. In some languages, upper/lower case
conversion is not one-to-one mapping.

99% of the time it's locale independant. I don't follow this logic of "if we can't make it work for everyone then it should stay broken for everyone". Following the Unicode rules would fix uppercasing for 95% of those not using english. And for the 5% of those who have to deal with those asymmetric upper/lower case rules... well, they'd have to deal with it either way.

*percentages above are guesswork

···

In message "Re: upcase in 1.9.2-preview1" > on Thu, 30 Jul 2009 02:13:41 +0900, Iñaki Baz Castillo <ibc@aliax.net> writes:

Topic		Replies	Views
String#upcase/downcase with UTF-8 strings in Ruby 1.9 ruby-talk	6	156	11 July 2008
"ñáéíóúàèìòù".upcase => " ñáéíóúàèìòù" ¿? ruby-talk	4	98	7 April 2009
[ANN] UnicodeUtils 0.2.0 - more Unicode for Ruby 1.9 ruby-talk	0	121	2 November 2008
String#upcase (UTF8) ruby-talk	0	80	23 October 2005
Downcase/uppercase for non-English characters ruby-talk	3	111	28 March 2006

Upcase in 1.9.2-preview1

Related topics