Integer#chr

Detlef_Wagner · 24 August 2019 18:01

Hi,

i have a problem with Integer#chr, where the encoding is UTF-8. If I call

0xC384.chr(Encoding::UTF_8)

in irb, I expect the German umlaut Ä, but I get 쎄, where I even can't
say what language it is

Is it a bug, or do I something wrong?

Cheers, detlef

ruby version ruby 2.5.5p157 (2019-03-15 revision 67260) [x86_64-linux-gnu]
irb version 0.9.6(09/06/30)

Skip_Gibson · 24 August 2019 18:18

I think you’ve just used the wrong unicode number, this should work

0x00C4.chr(Encoding::UTF_8)

Hope that helps,
Skip

···

On 24 August 2019 at 19:02:35, Detlef Wagner (detlef.wagner@gmx.org) wrote:

Hi,

i have a problem with Integer#chr, where the encoding is UTF-8. If I call

0xC384.chr(Encoding::UTF_8)

in irb, I expect the German umlaut Ä, but I get 쎄, where I even can't
say what language it is

Is it a bug, or do I something wrong?

Cheers, detlef

ruby version ruby 2.5.5p157 (2019-03-15 revision 67260) [x86_64-linux-gnu]
irb version 0.9.6(09/06/30)

Jorg_W_Mittag3 · 25 August 2019 14:20

Hi Detlef,

i have a problem with Integer#chr, where the encoding is UTF-8. If I call

0xC384.chr(Encoding::UTF_8)

in irb, I expect the German umlaut Ä, but I get 쎄, where I even can't
say what language it is

Is it a bug, or do I something wrong?

This is not a bug, this is exactly the expected behavior. If you look
up U+C384 in the Unicode Character Database, you will find:

https://unicode.org/cldr/utility/character.jsp?a=C384
쎄
C384
HANGUL SYLLABLE SSE
Hangul Script

So, in other words, you get exactly the character you asked for.

Note that Unicode was explicitly designed such that ASCII characters
have the same code point in Unicode as they have in ASCII, which means
that Latin letters and arabic numbers without diacritics generally
have the lowest code points. After that, come most other European
characters. So, the very high codepoint (50052) should already have
tipped you off that this can't be a European character.

The German Umlaut Ä (Latin capital letter A with diaeresis) can be
formed in Unicode in two different ways:

As the single pre-composed character U+00C4:

https://unicode.org/cldr/utility/character.jsp?a=00C4
Ä
00C4
LATIN CAPITAL LETTER A WITH DIAERESIS
Latin Script

or as the combination of the two characters U+0041 Latin capital
letter A and U+0308 Combining Diaeresis:

https://unicode.org/cldr/utility/character.jsp?a=0041
A
0041
LATIN CAPITAL LETTER A
Latin Script

followed by

https://unicode.org/cldr/utility/character.jsp?a=0308
̈
0308
COMBINING DIAERESIS
Nonspacing Mark

The different encodings have different advantages and disadvantages.
For example, the pre-composed encoding is closer to how a pre-Unicode
encoding in e.g. ISO8859-1 would work. The de-composed encoding has
the advantage that in UTF-8, characters in the ASCII character set
encode to the same encoding as ASCII, which means that obsolete text
processors that strip anything that is not 7-bit will not mangle the
text quite as badly. E.g. the word "Bär", which in the de-composed
form consists of the Unicode code points

U+0040 U+0061 U+0308 U+0072

will be encoded in UTF-8 as

\x40 \x61 \xCC\x88 \x72

when you process this with a broken text processor that strips
non-7-bit characters, the result will be

\x40 \x61 \x72

which is "Bar", and when processed with a broken text processor that
clears the 8th bit, the result will be

\x40 \x61 \x4C\x08 \x72

which is "BaL<BEL>r".

OTOH, the word "Bär" using a pre-composed Latin small letter a with
diaeresis consist of the Unicode code points

U+0040 U+00E4 U+0072

which is encoded in UTF-8 as

\x40 \xC3\xA4 \x72

Which will be mangled by broken text processors into

\x40 \x72 "Br"

or

\x40 \x43\x24 \x72 "BC$r".

Which of the two encodings you prefer is largely a matter of context,
and also at least a bit of taste.

Greetings,
Jörg.

Kim_Pedersen · 25 August 2019 15:35

Hi

The german umlaut Ä has unicode number U+00C4, which in UTF-8 encoding
becomes "\xC3\x84".

puts "\u00c4" => Ä

puts "\xc3\x84".force_encoding 'UTF-8' => Ä

The number 0xc384 interpreted as a unicode point is a korean character.

Best regards

Kim

···

Den 24.08.2019 20:01, skrev Detlef Wagner:

Hi,

i have a problem with Integer#chr, where the encoding is UTF-8. If I call

0xC384.chr(Encoding::UTF_8)

in irb, I expect the German umlaut Ä, but I get 쎄, where I even can't
say what language it is

Is it a bug, or do I something wrong?

Cheers, detlef

ruby version ruby 2.5.5p157 (2019-03-15 revision 67260) [x86_64-linux-gnu]
irb version 0.9.6(09/06/30)

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Topic		Replies	Views
String from code points? ruby-talk	3	130	27 May 2009
utf8 ruby-talk	10	366	5 September 2021
Multi-language support in Ruby ruby-talk	1	114	15 December 2002
Unicode in irb on windows (respectively script/console in instantrails) ruby-talk	6	107	9 November 2006
Euro currency symbol ruby-talk	4	88	31 March 2009

Integer#chr

Related topics