Ruby1.9: Encoding problems (how to use #force_encoding ?)

Inaki_Baz_Castillo · 1 September 2009 22:37

Hi, I'm using geo_location Ruby gem which returns to me a hash with the given
IP geolocation.

I use Ruby1.9 and UTF-8 works fine, but in this case, when the "city" has
"strange" symbols the the gem gives the string encoded in ASCII-8BIT.

For example:

Alarc'n (theorically it should be "Alarcón")

I need to send this string to a server which mandates UTF-8 usage so sending
it as it's fails.

I've tryed to convert the encoding but received an error:

  result.encode "UTF-8"
   => `encode': "\xF3" from ASCII-8BIT to UTF-8
      (Encoding::UndefinedConversionError)

I've also tryed with force_encoding:
result.force_encoding "UTF-8"

and then, the "result" string is converted to UTF-8 (I've checked
result.encoding) but it's also not valid for the server and when printing it I
see the same as before.

I need all of this just for a simple demo, so it owuld be valid for me just to
delete the non valid UTF-8 chars from the result string, but I don't know
how to do it.

Any help please?

···

--
Iñaki Baz Castillo <ibc@aliax.net>

Brian_Candler · 2 September 2009 12:15

Iñaki Baz Castillo wrote:

Hi, I'm using geo_location Ruby gem which returns to me a hash with the
given
IP geolocation.

Lots of gems are not ruby-1.9 compatible. You should probably report
problems to the author, ideally with a patch which fixes it, and a test
case which reproduces it.

I use Ruby1.9 and UTF-8 works fine, but in this case, when the "city"
has
"strange" symbols the the gem gives the string encoded in ASCII-8BIT.

All data read from a socket is tagged as ASCII-8BIT by default. That's
probably what's happening in the library you're using.

I need to send this string to a server which mandates UTF-8 usage so
sending
it as it's fails.

That doesn't make much sense. A string, when it hits a socket, is just a
stream of bytes. So you should be sending the same stream of bytes as
you receive.

I've tryed to convert the encoding but received an error:

  result.encode "UTF-8"
   => `encode': "\xF3" from ASCII-8BIT to UTF-8
      (Encoding::UndefinedConversionError)

That's correct. Transcoding tries to *transcode* (replace characters one
at a time), and these high characters in ASCII-8BIT have no Unicode
equivalents.

I've also tryed with force_encoding:
result.force_encoding "UTF-8"

and then, the "result" string is converted to UTF-8 (I've checked
result.encoding)

It's not converted, it's just tagged as being a string of UTF-8
characters, which it sounds like it is.

but it's also not valid for the server

Again, doesn't mean much without seeing the code which is trying to
submit this to the server.

I need all of this just for a simple demo, so it owuld be valid for me
just to
delete the non valid UTF-8 chars from the result string, but I don't
know
how to do it.

str.force_encoding("ASCII-8BIT") # if not already
str.gsub!(/[^\x20-\x7e]/,'')

···

--
Posted via http://www.ruby-forum.com/\.

Axel_Etzold · 2 September 2009 13:00

Dear Iñaki,

maybe you can use CGI.escape for your encoding problems?

I fetched the Spanish wikipedia page for Alarcón like so:

# encoding: utf-8

require "cgi"
require 'open-uri'

search_what=CGI.escape("Alarcón")
page="http://es.wikipedia.org/w/index.php?title=Especial%3ABuscar&search=#{search_what}&fulltext=Buscar"
open(page){ |f| print f.read }

Best regards,

Axel

···

--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser

Pedro_G · 13 January 2012 16:09

"That's correct. Transcoding tries to *transcode* (replace characters
one
at a time), and these high characters in ASCII-8BIT have no Unicode
equivalents."

It isn't true, in fact, \xc characters are Unicode code points and not
ASCII-2 charactes. You have a Unicode String (with encoding UTF-8) with
an encoding incorrectly set to ASCII-2, to solve this problem try this:

begin
  str.encode! Encoding::UTF_8 if str.encoding != Encoding::UTF_8
rescue Encoding::UndefinedConversionError
   #string incorrectly encoded try force
   str.force_encoding Encoding::UTF_8
end

···

--
Posted via http://www.ruby-forum.com/.

Inaki_Baz_Castillo · 2 September 2009 14:59

Thanks but the problem is that the geo_location Ruby gem returns a
wrong string (encoded in ASCII-8BIT) since it contains invalid chars
for ASCII-8BIT encoding, so Ruby fails when trying to convert it to
other encoding

···

2009/9/2 Axel Etzold <AEtzold@gmx.de>:

Dear Iñaki,

maybe you can use CGI.escape for your encoding problems?

I fetched the Spanish wikipedia page for Alarcón like so:

# encoding: utf-8

require "cgi"
require 'open-uri'

search_what=CGI.escape("Alarcón")
page="Buscar - Wikipedia, la enciclopedia libre;
open(page){ |f| print f.read }

--
Iñaki Baz Castillo
<ibc@aliax.net>

Brian_Candler · 14 January 2012 08:52

Pedro G. wrote in post #1040715:

"That's correct. Transcoding tries to *transcode* (replace characters
one
at a time), and these high characters in ASCII-8BIT have no Unicode
equivalents."

It isn't true

Yes, it *is* true, because it's exactly what the Ruby encoding
"ASCII_8BIT" means. It allows you to use \x80 to \xFF without defining
what character set those are in. Hence these characters cannot be
transcoded, since it's undefined what they are.

(Also, why are you resurrecting a 2-year-old thread?)

, in fact, \xc characters are Unicode code points (in UTF-8
encoding) and not ASCII-2 characters. You have a Unicode String with
UTF-8 encoding and encoding incorrectly set to ASCII-2, to solve this
problem try this:

What do you mean by ASCII-2? Standard ASCII is only a 7-bit character
set. There are a whole bunch of 8-bit extensions to ASCII, e.g.
ISO-8859-1, Windows-1252 etc. They all define different character sets
for \x80 to \xff. The encoding "ASCII_8BIT" makes no assertion about
what these high characters are.

begin
  str.encode! Encoding::UTF_8 if str.encoding != Encoding::UTF_8
rescue Encoding::UndefinedConversionError
   #string incorrectly encoded try force
   str.force_encoding Encoding::UTF_8
end

That's wrong, and just shows you don't understand the problem.

···

--
Posted via http://www.ruby-forum.com/\.

JEG2 · 2 September 2009 15:15

There are no invalid characters in ASCII-8BIT. It's a catch all Encoding. So that's definitely not the problem…

James Edward Gray II

···

On Sep 2, 2009, at 9:59 AM, Iñaki Baz Castillo wrote:

2009/9/2 Axel Etzold <AEtzold@gmx.de>:

Dear Iñaki,

maybe you can use CGI.escape for your encoding problems?

I fetched the Spanish wikipedia page for Alarcón like so:

# encoding: utf-8

require "cgi"
require 'open-uri'

search_what=CGI.escape("Alarcón")
page="Buscar - Wikipedia, la enciclopedia libre;
open(page){ |f| print f.read }

Thanks but the problem is that the geo_location Ruby gem returns a
wrong string (encoded in ASCII-8BIT) since it contains invalid chars
for ASCII-8BIT encoding, so Ruby fails when trying to convert it to
other encoding

Inaki_Baz_Castillo · 2 September 2009 19:41

Ok, that's a good point.
I'll try it.

···

El Miércoles, 2 de Septiembre de 2009, James Edward Gray II escribió:

>> search_what=CGI.escape("Alarcón")
>> page="Buscar - Wikipedia, la enciclopedia libre
>>=#{search_what}&fulltext=Buscar "
>> open(page){ |f| print f.read }
>
> Thanks but the problem is that the geo_location Ruby gem returns a
> wrong string (encoded in ASCII-8BIT) since it contains invalid chars
> for ASCII-8BIT encoding, so Ruby fails when trying to convert it to
> other encoding

There are no invalid characters in ASCII-8BIT. It's a catch all
Encoding. So that's definitely not the problem…

--
Iñaki Baz Castillo <ibc@aliax.net>

Topic		Replies	Views
[ENCODING] UTF8 hell ruby-talk	14	696	24 February 2010
How to send utf8 data to remote computer in ruby 1.9.2 ruby-talk	1	137	18 August 2011
How to send utf8 data to remote computer in ruby 1.9.2 ruby-talk	1	141	17 August 2011
Ruby 1.8.* convert string to utf-8 ruby-talk	7	219	20 August 2008
Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files ruby-talk	2	140	18 November 2010

Ruby1.9: Encoding problems (how to use #force_encoding ?)

Related topics