Open-uri, nokogiri and UTF-8 to US-ASCII

Hello,

I'm trying to retrieve search results from the internet using nokogiri and open-uri. Apparently 'open-uri' can't handle directly UTF-8. So I'm trying to convert the string in ASCII but still I come up with an error. Here is the chunk of code:

···

----------------------------------------
# encoding: UTF-8

require "nokogiri"
require "open-uri"

word = "Ελληνικά"
ascii_word = word.force_encoding("ASCII").to_s
result = open("http://search.lycos.com/web?q=#{ascii_word}", "User-Agent" => "HTTP_USER_AGENT:Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.47 S
doc = Nokogiri::HTML(result)
----------------------------------------
And the error I get is:
----------------------------------------
[...]:in `open': invalid byte sequence in US-ASCII (ArgumentError)
  from lycos.rb:8:in `<main>'
----------------------------------------

I'm on MacOSX ML, using ruby (rvm) 1.9.3 .

I tried using 'force_encofing("US-ASCII")' but it's not a recognized format. The word is Greek and uses UTF-8. Any ideas would be welcomed.

Thanks for your time,

Best Regards

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0xE736C6A0
gpg --keyserver x-hkp://pgp.mit.edu --recv-keys 0xE736C6A0
--
The wise man said: "Never argue with an idiot. They bring you down to their level and beat you with experience."

As per RFC (2396?), you need to encode the non-asci bit, thusly:

#!/usr/bin/ruby

encoding: UTF-8

require “nokogiri”
require “open-uri”

word = URI.encode(“Ελληνικά”)
result = open(“Lycos.com”,
“User-Agent” =>
“HTTP_USER_AGENT:Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.47”)
doc = Nokogiri::HTML(result)
puts doc

-jh

···

On Thu, 8 Nov 2012 01:07:41 +0900, Panagiotis Atmatzidis wrote:

Hello,

I’m trying to retrieve search results from the internet using nokogiri and open-uri. Apparently ‘open-uri’ can’t handle directly UTF-8. So I’m trying to convert the string in ASCII but still I come up with an error. Here is the chunk of code:

encoding: UTF-8

require “nokogiri”
require “open-uri”

word = “Ελληνικά”
ascii_word = word.force_encoding(“ASCII”).to_s
result = open(“Lycos.com”, “User-Agent” => "HTTP_USER_AGENT:Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.47 S
doc = Nokogiri::HTML(result)

And the error I get is:

[…]:in open': invalid byte sequence in US-ASCII (ArgumentError) from lycos.rb:8:in

I’m on MacOSX ML, using ruby (rvm) 1.9.3 .

If one were to examine the ruby URI docs, how would one know that there
is a method named URI.escape?

···

--
Posted via http://www.ruby-forum.com/.

try eg,

ri URI::Escape.escape

best regards -botp

···

On Thu, Nov 8, 2012 at 9:41 AM, 7stud -- <lists@ruby-forum.com> wrote:

If one were to examine the ruby URI docs, how would one know that there
is a method named URI.escape?

botp wrote in post #1083491:

···

On Thu, Nov 8, 2012 at 9:41 AM, 7stud -- <lists@ruby-forum.com> wrote:

If one were to examine the ruby URI docs, how would one know that there
is a method named URI.escape?

try eg,

ri URI::Escape.escape

But to write that, you already have to know there is an escape() method
in some namespace somewhere. How come when I look at the docs, there
isn't a list of methods that I can call on URI?

--
Posted via http://www.ruby-forum.com/\.

URI is big. as if now, we'll have to dig down further.

$ ri URI | grep "* URI::"
* URI::Generic (in uri/generic.rb)
  * URI::FTP - (in uri/ftp.rb)
  * URI::HTTP - (in uri/http.rb)
    * URI::HTTPS - (in uri/https.rb)
  * URI::LDAP - (in uri/ldap.rb)
    * URI::LDAPS - (in uri/ldaps.rb)
  * URI::MailTo - (in uri/mailto.rb)
* URI::Parser - (in uri/common.rb)
* URI::REGEXP - (in uri/common.rb)
  * URI::REGEXP::PATTERN - (in uri/common.rb)
* URI::Util - (in uri/common.rb)
* URI::Escape - (in uri/common.rb)
* URI::Error - (in uri/common.rb)
  * URI::InvalidURIError - (in uri/common.rb)
  * URI::InvalidComponentError - (in uri/common.rb)
  * URI::BadURIError - (in uri/common.rb)

you can creat a tiny script to drill down on those.. or you can use
html ruby-doc wc has clickable links of the above..

ri isn't perfect like all other docs, but i guess you already knew that ;- )

best regards
-botp

···

On Fri, Nov 9, 2012 at 7:58 AM, 7stud -- <lists@ruby-forum.com> wrote:

How come when I look at the docs, there isn't a list of methods that I can call on URI?