[ENCODING] UTF8 hell

Xavier_Noelle · 2 February 2010 11:15

Hello,
I'm trying to deal with Ruby flaws with encoding, which I thought
would be almost past with Ruby 1.9. I managed to find a solution for
Ruby 1.8 and thought I did for Ruby 1.9...but in fact, no !

I fetch rows from an UTF8 database and try to work with the string. To
do so, I would like it to be UTF8 encoded.

"str.encoding()" gives me "ASCII-8BIT"...so, I thought one of these
lines would solve the problem
str.replace(Iconv.iconv("UTF8", "ascii", self).join())
OR
self.encode!('UTF-8')

But they don't !
First one: in `iconv': "\xE8te pour luth" (Iconv::IllegalSequence)
Second one: in `encode!': "\xE8" from ASCII-8BIT to UTF-8
(Encoding::UndefinedConversionError)

The base string is "Oeuvre complète pour luth" and displays well in PHPMyAdmin.

Any idea ?
TIA,

···

--
Xavier NOELLE

Stefano_Crocco · 2 February 2010 12:54

I'm not sure, but basing on my experience, it may be that the string are
indeed stored as UTF-8, but the library you use to read from the database
doesn't take care of informing ruby of the fact, so ruby assumes it is a
generic array of bytes (which means, ruby thinks the string has encoding
ASCII-8BIT, which is the same as BINARY).

If this is the case, you don't need to transcode the string (which is what
encode does), but simply tell ruby which is the correct encoding, using the
force_encoding method.

I hope this helps

Stefano

···

On Tuesday 02 February 2010, Xavier Noëlle wrote:

>Hello,
>I'm trying to deal with Ruby flaws with encoding, which I thought
>would be almost past with Ruby 1.9. I managed to find a solution for
>Ruby 1.8 and thought I did for Ruby 1.9...but in fact, no !
>
>I fetch rows from an UTF8 database and try to work with the string. To
>do so, I would like it to be UTF8 encoded.
>
>"str.encoding()" gives me "ASCII-8BIT"...so, I thought one of these
>lines would solve the problem
>str.replace(Iconv.iconv("UTF8", "ascii", self).join())
>OR
>self.encode!('UTF-8')
>
>But they don't !
>First one: in `iconv': "\xE8te pour luth" (Iconv::IllegalSequence)
>Second one: in `encode!': "\xE8" from ASCII-8BIT to UTF-8
>(Encoding::UndefinedConversionError)
>
>The base string is "Oeuvre complète pour luth" and displays well in
>PHPMyAdmin.
>
>Any idea ?
>TIA,

David_Palm · 2 February 2010 13:26

I fetch rows from an UTF8 database and try to work with the string. To
do so, I would like it to be UTF8 encoded.

There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).

self.encode!('UTF-8')

str.force_encoding('UTF-8') is what you want to use I think.

Xavier_Noelle · 2 February 2010 14:12

There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).

Not a Rails app

str.force_encoding('UTF-8') is what you want to use I think.

I already tried this method, but it lead me to the following error: in
`downcase!': invalid byte sequence in UTF-8 (ArgumentError).

This is due to a call to str.downcase!() later in the application.

Any idea to solve this ?

···

2010/2/2 David Palm <dvdplm@gmail.com>:

--
Xavier NOELLE

Robert_K1 · 2 February 2010 14:48

You probably first want to find out whether the byte sequence is valid
UTF-8 or not. For that you would need to look at the bytes in the
String. I guess chances are that your String's byte sequence is NOT
valid UTF-8 OR you have a character in the string that has no
lowercase representation.

Kind regards

robert

···

2010/2/2 Xavier Noëlle <xavier.noelle@gmail.com>:

2010/2/2 David Palm <dvdplm@gmail.com>:

There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).

Not a Rails app

str.force_encoding('UTF-8') is what you want to use I think.

I already tried this method, but it lead me to the following error: in
`downcase!': invalid byte sequence in UTF-8 (ArgumentError).

This is due to a call to str.downcase!() later in the application.

Any idea to solve this ?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Xavier_Noelle · 23 February 2010 11:10

I dug into the problem and ended up with this line: self.force_encoding('UTF-8')
Believing that the string #encoding was right was a wrong choice, then
I assumed the database provided valid UTF8 strings.

BUT (because, there's a but...), for some reason I don't understand,
some strings are unwilling to work:

Example:
puts self => médicals
self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115

233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
byte sequence in UTF-8 (ArgumentError).

Where am I wrong ?

TIA,

···

2010/2/2 Robert Klemme <shortcutter@googlemail.com>:

You probably first want to find out whether the byte sequence is valid
UTF-8 or not. For that you would need to look at the bytes in the
String. I guess chances are that your String's byte sequence is NOT
valid UTF-8 OR you have a character in the string that has no
lowercase representation.

Kind regards

robert

--
Xavier NOELLE

Marc_Heiler · 23 February 2010 14:03

How does python solve this?

···

--
Posted via http://www.ruby-forum.com/.

Yukihiro_Matsumoto2 · 23 February 2010 14:41

Hi,

···

In message "Re: [ENCODING] UTF8 hell" on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle <xavier.noelle@gmail.com> writes:

self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115

233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
byte sequence in UTF-8 (ArgumentError).

233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.

matz.

Robert_K1 · 23 February 2010 22:10

You probably first want to find out whether the byte sequence is valid
UTF-8 or not. For that you would need to look at the bytes in the
String. I guess chances are that your String's byte sequence is NOT
valid UTF-8 OR you have a character in the string that has no
lowercase representation.

I dug into the problem and ended up with this line: self.force_encoding('UTF-8')
Believing that the string #encoding was right was a wrong choice, then
I assumed the database provided valid UTF8 strings.

The string you show below does not look like UTF-8 encoded, probably rather ISO-8859-1 or such. If you enforce an encoding you leave the byte sequence untouched. This leads to the kind of error you describe below.

BUT (because, there's a but...), for some reason I don't understand,
some strings are unwilling to work:

Example:
puts self => médicals
self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115

233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
byte sequence in UTF-8 (ArgumentError).

Where am I wrong ?

As far as I can see 233 starts a three byte sequence

I did not dig deeper but it may be that by forcing UTF-8 on an ISO something encoded string you broke it.

Kind regards

robert

···

On 23.02.2010 12:10, Xavier Noëlle wrote:

2010/2/2 Robert Klemme <shortcutter@googlemail.com>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Rick_DeNatale1 · 23 February 2010 15:03

233 for e accent acute would be valid for ISO-8859-1 encoding, not UTF-8.

···

On Tue, Feb 23, 2010 at 9:41 AM, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

Hi,

In message "Re: [ENCODING] UTF8 hell" > on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle <xavier.noelle@gmail.com> writes:

>self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115
>
>233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
>self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
>byte sequence in UTF-8 (ArgumentError).

233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

Xavier_Noelle · 23 February 2010 15:18

Indeed. In the meantime, I changed the code with this one:
def isUTF8()
  begin
    self.unpack('U*')
  rescue
    return false
  end
  return true
end

if isUTF8()
  self.force_encoding('UTF-8')
else
  self.force_encoding('ISO-8859-1')
  self.encode!('UTF-8')
end

This (ugly) quickfix works for what I need, but I don't know if this
problem can be somehow resolved in another way. The problem being that
my SQL database has a VARBINARY column with an unknown encoding. Is
there a way to deal with the various possible encoding or to ask MySQL
to return UTF8 converted data, or is it necessary to clean data before
inserting them ?

···

2010/2/23 Yukihiro Matsumoto <matz@ruby-lang.org>:

233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.

--
Xavier NOELLE

Jorg_W_Mittag1 · 23 February 2010 17:00

Yukihiro Matsumoto wrote:

self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115

233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
byte sequence in UTF-8 (ArgumentError).

233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.

A general hint for debugging encoding troubles: the UTF-8 encoding
*guarantees* that every Unicode codepoint is *either* encoded into a
*single* octet with its most significant bit cleared to 0 (i.e. a
decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
octets, *all* of which have their MSB set to 1 (i.e. a decimal value
between 128 and 255).

A *single* octet with its MSB set to 1 can *never* be a valid UTF-8
character, it can only be part of a multi-octet character, i.e. it
must appear either immediately before or after or between another
octet with its MSB set. However, in your string there is no
multi-octet character sequence, there is only a single character with
its MSB set (the second one with the decimal value 233), so you can
see without having to look at any code tables that this string
*cannot* possibly be a UTF-8 string.

As Rick already hinted, it is either an ISO/IEC 8859-1, ISO/IEC
8859-2, ISO/IEC 8859-3, ISO/IEC 8859-4, ISO/IEC 8859-9, ISO/IEC
8859-10, ISO/IEC 8859-13, ISO/IEC 8859-14, ISO/IEC 8859-15, ISO/IEC
8859-16, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-9,
ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16 or
Windows-1252 string (it's impossible to tell, but makes no difference
in this case). My guess is on ISO-8859-15.

[This property is BTW what makes UTF-8 compatible with ASCII, because
it guarantees that *every* Unicode character which is also in ASCII,
will be encoded the same way as it would be in ASCII and every Unicode
character which is *not* in ASCII will be encoded as a sequence of
octets each of which is illegal in ASCII. It also provides some
robustness against 8-bit encodings such as the ISO8859 family, because
statistically it is very likely that *somewhere* in the text, there
will be a single octet with its MSB set (in this case, it's the é and
in my name it's the ö), which is surrounded by octets with their MSB
cleared, which cannot ever happen in UTF-8.]

jwm

···

In message "Re: [ENCODING] UTF8 hell" > on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle <xavier.noelle@gmail.com> writes:

Perry_Smith1 · 23 February 2010 17:20

A general hint for debugging encoding troubles: the UTF-8 encoding
*guarantees* that every Unicode codepoint is *either* encoded into a
*single* octet with its most significant bit cleared to 0 (i.e. a
decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
octets, *all* of which have their MSB set to 1 (i.e. a decimal value
between 128 and 255).

Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
or 6 but not 3 nor 5 octects?

···

--
Posted via http://www.ruby-forum.com/\.

Michael_Fellinger1 · 24 February 2010 04:12

233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.

Indeed. In the meantime, I changed the code with this one:
def isUTF8()
begin
self.unpack('U*')
rescue
return false
end
return true
end

if isUTF8()
self.force_encoding('UTF-8')
else
self.force_encoding('ISO-8859-1')
self.encode!('UTF-8')
end

string = "\xE8te pour luth"
# "\xE8te pour luth"
string.encoding
# #<Encoding:UTF-8>
string.valid_encoding?
# false
string.force_encoding('ISO-8859-1')
# "ète pour luth"
string.valid_encoding?
# true
string.upcase
# "èTE POUR LUTH"

···

On Wed, Feb 24, 2010 at 12:18 AM, Xavier Noëlle <xavier.noelle@gmail.com> wrote:

2010/2/23 Yukihiro Matsumoto <matz@ruby-lang.org>:

This (ugly) quickfix works for what I need, but I don't know if this
problem can be somehow resolved in another way. The problem being that
my SQL database has a VARBINARY column with an unknown encoding. Is
there a way to deal with the various possible encoding or to ask MySQL
to return UTF8 converted data, or is it necessary to clean data before
inserting them ?

--
Xavier NOELLE

--
Michael Fellinger
CTO, The Rubyists, LLC
972-996-5199

Jorg_W_Mittag1 · 23 February 2010 21:45

Perry Smith wrote:

A general hint for debugging encoding troubles: the UTF-8 encoding
*guarantees* that every Unicode codepoint is *either* encoded into a
*single* octet with its most significant bit cleared to 0 (i.e. a
decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
octets, *all* of which have their MSB set to 1 (i.e. a decimal value
between 128 and 255).

Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
or 6 but not 3 nor 5 octects?

Nope.

First off: I was wrong, the longest encoding is actually 4 octets,
not 6. (I was confused by the algorithm: the algorithm actually allows
for up to 8 bytes, but because of the way Unicode characters are
allocated, and UTF-8 is defined, it is guaranteed that there will
never be more than 4.)

The encodings look like this:

0xxxxxxx for ASCII
110xxxxx 10xxxxxx for U+80 to U+7FF
1110xxxx 10xxxxxx 10xxxxxx for U+800 to U+FFFF and
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for U+1000 to U+1FFFFF

This is actually pretty clever:

* you can always tell whether you are inside a multibyte sequence or
  not because of the high bit,
* you can always tell whether a byte in the sequence is the first one
  or a later one, because the first one always starts with 11 and the
  other ones always start with 10 and
* you can always tell how long a sequence is by the number of 1 bits
  in the start byte: two-byte sequences start with two 1s, three-byte
  sequences start with three 1s and four-byte sequences start with
  four 1s.

This means that you can usually re-synchronize pretty easily from the
middle of a corrupted network transmission, for example. You can also
jump over bytes if you are counting the length.

jwm

Topic		Replies	Views
Encoding of strings received from db ruby-talk	5	101	27 June 2007
UTF-8 question ruby-talk	20	187	15 August 2003
Unicode roadmap? ruby-talk	17	106	18 June 2006
Ruby iconv UTF-8 to ISO-8859-2 (Polish) ruby-talk	10	329	5 January 2007
Unicode ruby-talk	25	169	1 October 2007

[ENCODING] UTF8 hell

Related topics