Yukihiro Matsumoto wrote:
self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115
233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
byte sequence in UTF-8 (ArgumentError).
233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.
A general hint for debugging encoding troubles: the UTF-8 encoding
*guarantees* that every Unicode codepoint is *either* encoded into a
*single* octet with its most significant bit cleared to 0 (i.e. a
decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
octets, *all* of which have their MSB set to 1 (i.e. a decimal value
between 128 and 255).
A *single* octet with its MSB set to 1 can *never* be a valid UTF-8
character, it can only be part of a multi-octet character, i.e. it
must appear either immediately before or after or between another
octet with its MSB set. However, in your string there is no
multi-octet character sequence, there is only a single character with
its MSB set (the second one with the decimal value 233), so you can
see without having to look at any code tables that this string
*cannot* possibly be a UTF-8 string.
As Rick already hinted, it is either an ISO/IEC 8859-1, ISO/IEC
8859-2, ISO/IEC 8859-3, ISO/IEC 8859-4, ISO/IEC 8859-9, ISO/IEC
8859-10, ISO/IEC 8859-13, ISO/IEC 8859-14, ISO/IEC 8859-15, ISO/IEC
8859-16, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-9,
ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16 or
Windows-1252 string (it's impossible to tell, but makes no difference
in this case). My guess is on ISO-8859-15.
[This property is BTW what makes UTF-8 compatible with ASCII, because
it guarantees that *every* Unicode character which is also in ASCII,
will be encoded the same way as it would be in ASCII and every Unicode
character which is *not* in ASCII will be encoded as a sequence of
octets each of which is illegal in ASCII. It also provides some
robustness against 8-bit encodings such as the ISO8859 family, because
statistically it is very likely that *somewhere* in the text, there
will be a single octet with its MSB set (in this case, it's the é and
in my name it's the ö), which is surrounded by octets with their MSB
cleared, which cannot ever happen in UTF-8.]
jwm
···
In message "Re: [ENCODING] UTF8 hell" > on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle <xavier.noelle@gmail.com> writes: