Ruby1.9: Encoding problems (how to use #force_encoding ?)

Hi, I'm using geo_location Ruby gem which returns to me a hash with the given
IP geolocation.

I use Ruby1.9 and UTF-8 works fine, but in this case, when the "city" has
"strange" symbols the the gem gives the string encoded in ASCII-8BIT.

For example:

    Alarc'n (theorically it should be "Alarcón")

I need to send this string to a server which mandates UTF-8 usage so sending
it as it's fails.

I've tryed to convert the encoding but received an error:

  result.encode "UTF-8"
   => `encode': "\xF3" from ASCII-8BIT to UTF-8
      (Encoding::UndefinedConversionError)

I've also tryed with force_encoding:
  result.force_encoding "UTF-8"

and then, the "result" string is converted to UTF-8 (I've checked
result.encoding) but it's also not valid for the server and when printing it I
see the same as before.

I need all of this just for a simple demo, so it owuld be valid for me just to
delete the non valid UTF-8 chars from the result string, but I don't know
how to do it.

Any help please?

···

--
Iñaki Baz Castillo <ibc@aliax.net>

Iñaki Baz Castillo wrote:

Hi, I'm using geo_location Ruby gem which returns to me a hash with the
given
IP geolocation.

Lots of gems are not ruby-1.9 compatible. You should probably report
problems to the author, ideally with a patch which fixes it, and a test
case which reproduces it.

I use Ruby1.9 and UTF-8 works fine, but in this case, when the "city"
has
"strange" symbols the the gem gives the string encoded in ASCII-8BIT.

All data read from a socket is tagged as ASCII-8BIT by default. That's
probably what's happening in the library you're using.

I need to send this string to a server which mandates UTF-8 usage so
sending
it as it's fails.

That doesn't make much sense. A string, when it hits a socket, is just a
stream of bytes. So you should be sending the same stream of bytes as
you receive.

I've tryed to convert the encoding but received an error:

  result.encode "UTF-8"
   => `encode': "\xF3" from ASCII-8BIT to UTF-8
      (Encoding::UndefinedConversionError)

That's correct. Transcoding tries to *transcode* (replace characters one
at a time), and these high characters in ASCII-8BIT have no Unicode
equivalents.

I've also tryed with force_encoding:
  result.force_encoding "UTF-8"

and then, the "result" string is converted to UTF-8 (I've checked
result.encoding)

It's not converted, it's just tagged as being a string of UTF-8
characters, which it sounds like it is.

but it's also not valid for the server

Again, doesn't mean much without seeing the code which is trying to
submit this to the server.

I need all of this just for a simple demo, so it owuld be valid for me
just to
delete the non valid UTF-8 chars from the result string, but I don't
know
how to do it.

   str.force_encoding("ASCII-8BIT") # if not already
   str.gsub!(/[^\x20-\x7e]/,'')

···

--
Posted via http://www.ruby-forum.com/.

Dear Iñaki,

maybe you can use CGI.escape for your encoding problems?

I fetched the Spanish wikipedia page for Alarcón like so:

# encoding: utf-8

require "cgi"
require 'open-uri'

search_what=CGI.escape("Alarcón")
page="http://es.wikipedia.org/w/index.php?title=Especial%3ABuscar&search=#{search_what}&fulltext=Buscar"
open(page){ |f| print f.read }

Best regards,

Axel

···

--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser

"That's correct. Transcoding tries to *transcode* (replace characters
one
at a time), and these high characters in ASCII-8BIT have no Unicode
equivalents."

It isn't true, in fact, \xc characters are Unicode code points and not
ASCII-2 charactes. You have a Unicode String (with encoding UTF-8) with
an encoding incorrectly set to ASCII-2, to solve this problem try this:

begin
  str.encode! Encoding::UTF_8 if str.encoding != Encoding::UTF_8
rescue Encoding::UndefinedConversionError
   #string incorrectly encoded try force
   str.force_encoding Encoding::UTF_8
end

···

--
Posted via http://www.ruby-forum.com/.

Thanks but the problem is that the geo_location Ruby gem returns a
wrong string (encoded in ASCII-8BIT) since it contains invalid chars
for ASCII-8BIT encoding, so Ruby fails when trying to convert it to
other encoding :frowning:

···

2009/9/2 Axel Etzold <AEtzold@gmx.de>:

Dear Iñaki,

maybe you can use CGI.escape for your encoding problems?

I fetched the Spanish wikipedia page for Alarcón like so:

# encoding: utf-8

require "cgi"
require 'open-uri'

search_what=CGI.escape("Alarcón")
page="http://es.wikipedia.org/w/index.php?title=Especial%3ABuscar&search=#{search_what}&fulltext=Buscar"
open(page){ |f| print f.read }

--
Iñaki Baz Castillo
<ibc@aliax.net>

Pedro G. wrote in post #1040715:

"That's correct. Transcoding tries to *transcode* (replace characters
one
at a time), and these high characters in ASCII-8BIT have no Unicode
equivalents."

It isn't true

Yes, it *is* true, because it's exactly what the Ruby encoding
"ASCII_8BIT" means. It allows you to use \x80 to \xFF without defining
what character set those are in. Hence these characters cannot be
transcoded, since it's undefined what they are.

(Also, why are you resurrecting a 2-year-old thread?)

, in fact, \xc characters are Unicode code points (in UTF-8
encoding) and not ASCII-2 characters. You have a Unicode String with
UTF-8 encoding and encoding incorrectly set to ASCII-2, to solve this
problem try this:

What do you mean by ASCII-2? Standard ASCII is only a 7-bit character
set. There are a whole bunch of 8-bit extensions to ASCII, e.g.
ISO-8859-1, Windows-1252 etc. They all define different character sets
for \x80 to \xff. The encoding "ASCII_8BIT" makes no assertion about
what these high characters are.

begin
  str.encode! Encoding::UTF_8 if str.encoding != Encoding::UTF_8
rescue Encoding::UndefinedConversionError
   #string incorrectly encoded try force
   str.force_encoding Encoding::UTF_8
end

That's wrong, and just shows you don't understand the problem.

···

--
Posted via http://www.ruby-forum.com/.

There are no invalid characters in ASCII-8BIT. It's a catch all Encoding. So that's definitely not the problem… :wink:

James Edward Gray II

···

On Sep 2, 2009, at 9:59 AM, Iñaki Baz Castillo wrote:

2009/9/2 Axel Etzold <AEtzold@gmx.de>:

Dear Iñaki,

maybe you can use CGI.escape for your encoding problems?

I fetched the Spanish wikipedia page for Alarcón like so:

# encoding: utf-8

require "cgi"
require 'open-uri'

search_what=CGI.escape("Alarcón")
page="http://es.wikipedia.org/w/index.php?title=Especial%3ABuscar&search=#{search_what}&fulltext=Buscar"
open(page){ |f| print f.read }

Thanks but the problem is that the geo_location Ruby gem returns a
wrong string (encoded in ASCII-8BIT) since it contains invalid chars
for ASCII-8BIT encoding, so Ruby fails when trying to convert it to
other encoding :frowning:

Ok, that's a good point.
I'll try it.

···

El Miércoles, 2 de Septiembre de 2009, James Edward Gray II escribió:

>> search_what=CGI.escape("Alarcón")
>> page="http://es.wikipedia.org/w/index.php?title=Especial%3ABuscar&search
>>=#{search_what}&fulltext=Buscar "
>> open(page){ |f| print f.read }
>
> Thanks but the problem is that the geo_location Ruby gem returns a
> wrong string (encoded in ASCII-8BIT) since it contains invalid chars
> for ASCII-8BIT encoding, so Ruby fails when trying to convert it to
> other encoding :frowning:

There are no invalid characters in ASCII-8BIT. It's a catch all
Encoding. So that's definitely not the problem… :wink:

--
Iñaki Baz Castillo <ibc@aliax.net>