Hi Detlef,
i have a problem with Integer#chr, where the encoding is UTF-8. If I call
0xC384.chr(Encoding::UTF_8)
in irb, I expect the German umlaut Ä, but I get 쎄, where I even can't
say what language it is
Is it a bug, or do I something wrong?
This is not a bug, this is exactly the expected behavior. If you look
up U+C384 in the Unicode Character Database, you will find:
https://unicode.org/cldr/utility/character.jsp?a=C384
쎄
C384
HANGUL SYLLABLE SSE
Hangul Script
So, in other words, you get exactly the character you asked for.
Note that Unicode was explicitly designed such that ASCII characters
have the same code point in Unicode as they have in ASCII, which means
that Latin letters and arabic numbers without diacritics generally
have the lowest code points. After that, come most other European
characters. So, the very high codepoint (50052) should already have
tipped you off that this can't be a European character.
The German Umlaut Ä (Latin capital letter A with diaeresis) can be
formed in Unicode in two different ways:
As the single pre-composed character U+00C4:
https://unicode.org/cldr/utility/character.jsp?a=00C4
Ä
00C4
LATIN CAPITAL LETTER A WITH DIAERESIS
Latin Script
or as the combination of the two characters U+0041 Latin capital
letter A and U+0308 Combining Diaeresis:
https://unicode.org/cldr/utility/character.jsp?a=0041
A
0041
LATIN CAPITAL LETTER A
Latin Script
followed by
https://unicode.org/cldr/utility/character.jsp?a=0308
̈
0308
COMBINING DIAERESIS
Nonspacing Mark
The different encodings have different advantages and disadvantages.
For example, the pre-composed encoding is closer to how a pre-Unicode
encoding in e.g. ISO8859-1 would work. The de-composed encoding has
the advantage that in UTF-8, characters in the ASCII character set
encode to the same encoding as ASCII, which means that obsolete text
processors that strip anything that is not 7-bit will not mangle the
text quite as badly. E.g. the word "Bär", which in the de-composed
form consists of the Unicode code points
U+0040 U+0061 U+0308 U+0072
will be encoded in UTF-8 as
\x40 \x61 \xCC\x88 \x72
when you process this with a broken text processor that strips
non-7-bit characters, the result will be
\x40 \x61 \x72
which is "Bar", and when processed with a broken text processor that
clears the 8th bit, the result will be
\x40 \x61 \x4C\x08 \x72
which is "BaL<BEL>r".
OTOH, the word "Bär" using a pre-composed Latin small letter a with
diaeresis consist of the Unicode code points
U+0040 U+00E4 U+0072
which is encoded in UTF-8 as
\x40 \xC3\xA4 \x72
Which will be mangled by broken text processors into
\x40 \x72 "Br"
or
\x40 \x43\x24 \x72 "BC$r".
Which of the two encodings you prefer is largely a matter of context,
and also at least a bit of taste.
Greetings,
Jörg.