String#upcase/downcase with UTF-8 strings in Ruby 1.9

Stefan_Schmidt · 9 July 2008 22:09

Hello,

in Ruby 1.9 I get the following behaviour:

"aoueäöüé".upcase

=> "AOUEäöüé"

"AOUEÄÖÜÉ".downcase

=> "aoueÄÖÜÉ"

I can't find however find a bug in the bug tracking system.
Doesn't this qualify as a bug?

Cheers, Stefan

Yukihiro_Matsumoto2 · 9 July 2008 23:25

Hi,

in Ruby 1.9 I get the following behaviour:

"aoueäöüé".upcase

=> "AOUEäöüé"

"AOUEÄÖÜÉ".downcase

=> "aoueÄÖÜÉ"

I can't find however find a bug in the bug tracking system.
Doesn't this qualify as a bug?

The document for String#upcase says:

  call-seq:
     str.upcase => new_str

  Returns a copy of <i>str</i> with all lowercase letters replaced with their
  uppercase counterparts. The operation is locale insensitive---only
  characters ``a'' to ``z'' are affected.
  Note: case replacement is effective only in ASCII region.

     "hEllO".upcase #=> "HELLO"

See "Note:". Tim Bray have persuaded me to do so, since case
conversion outside of ASCII region is highly dependent on country,
language, culture and script.

matz.

···

In message "Re: String#upcase/downcase with UTF-8 strings in Ruby 1.9" on Thu, 10 Jul 2008 07:09:29 +0900, "Stefan Schmidt" <Stefan.Schmidt@gmx.net> writes:

James_Britt · 10 July 2008 00:16

This leaves the perfect opening for people to contribute locale or language specific extensions to String.
It would make a great gem with a plug-in architecture.
Just add options for the language you want to use.
In any case it can get very tricky to do character conversions with different languages.

···

On Jul 9, 2008, at 6:25 PM, Yukihiro Matsumoto wrote:

Hi,

In message "Re: String#upcase/downcase with UTF-8 strings in Ruby 1.9" > on Thu, 10 Jul 2008 07:09:29 +0900, "Stefan Schmidt" <Stefan.Schmidt@gmx.net > > writes:

>in Ruby 1.9 I get the following behaviour:
>
>>> "aoueäöüé".upcase
>=> "AOUEäöüé"
>>> "AOUEÄÖÜÉ".downcase
>=> "aoueÄÖÜÉ"
>
>I can't find however find a bug in the bug tracking system.
>Doesn't this qualify as a bug?

The document for String#upcase says:

call-seq:
    str.upcase => new_str

Returns a copy of <i>str</i> with all lowercase letters replaced with their
uppercase counterparts. The operation is locale insensitive---only
characters ``a'' to ``z'' are affected.
Note: case replacement is effective only in ASCII region.

    "hEllO".upcase #=> "HELLO"

See "Note:". Tim Bray have persuaded me to do so, since case
conversion outside of ASCII region is highly dependent on country,
language, culture and script.

              matz.

Stefan_Schmidt · 10 July 2008 01:17

The document for String#upcase says:

Yes, sorry, I should have read the documentation

See "Note:". Tim Bray have persuaded me to do so, since case
conversion outside of ASCII region is highly dependent on country,
language, culture and script.

So basically the Python guys are going down a wrong route ?

# -*- coding: utf-8 -*-
import string
print string.upper(u"aoueäöüé")
print string.lower(u"AOUEÄÖÜÉ")

works as expected.

Cheers, Stefan

James_Britt · 10 July 2008 01:25

No.
They're going down a different route.
Seriously, the language handling is something that could easily be handled by extensions. It does not need to be a core part of the language.
Even operating systems handle these things with proprietary and very sophisticated techniques based on the language in question.
In most cases, what you are expecting to be the correct upper case characters may be 'correct' but it will ultimately depend on the language and the context.

···

On Jul 9, 2008, at 8:17 PM, Stefan Schmidt wrote:

The document for String#upcase says:

Yes, sorry, I should have read the documentation

See "Note:". Tim Bray have persuaded me to do so, since case
conversion outside of ASCII region is highly dependent on country,
language, culture and script.

So basically the Python guys are going down a wrong route ?

# -*- coding: utf-8 -*-
import string
print string.upper(u"aoueäöüé")
print string.lower(u"AOUEÄÖÜÉ")

works as expected.

Cheers, Stefan

Stefan_Schmidt · 10 July 2008 15:39

> So basically the Python guys are going down a wrong route ?
>
> # -*- coding: utf-8 -*-
> import string
> print string.upper(u"aoueäöüé")
> print string.lower(u"AOUEÄÖÜÉ")
>
> works as expected.
>
> Cheers, Stefan
>
No.
They're going down a different route.
Seriously, the language handling is something that could easily be
handled by extensions. It does not need to be a core part of the
language.

Is Nikolai Weibull's Ruby Character Encodings Library [1] currently the best way to go?

Stefan

[1] http://bitwi.se/software/ruby/character-encodings/

Stefan_Schmidt · 11 July 2008 05:29

Seriously, the language handling is something that could easily be
handled by extensions. It does not need to be a core part of the
language.

Are there any working extensions for Ruby 1.9 that offer Unicode support for String#downcase/upcase and/or Array#sort?

Stefan

Topic		Replies	Views
Upcase in 1.9.2-preview1 ruby-talk	7	122	3 August 2009
"ñáéíóúàèìòù".upcase => " ñáéíóúàèìòù" ¿? ruby-talk	4	101	7 April 2009
[ANN] UnicodeUtils 0.2.0 - more Unicode for Ruby 1.9 ruby-talk	0	121	2 November 2008
String#upcase (UTF8) ruby-talk	0	80	23 October 2005
[ANN] UnicodeUtils 0.1.0 - more Unicode for Ruby 1.9 ruby-talk	1	136	28 October 2008

String#upcase/downcase with UTF-8 strings in Ruby 1.9

Related topics