Open-uri with non-ascii character

ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin10.8.0]

I want to parse a page like

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿
the url contains non-ascii character as a query. In this particular
case, it's Chinese.

If I try to open this page like

doc = Nokogiri::HTML(open(query)).read

it gives an error

/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:176:in
`split': bad URI(is not URI?):
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=日
(URI::InvalidURIError)
  from
/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:211:in
`parse'
  from
/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:747:in
`parse'
  from
/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/open-uri.rb:32:in
`open'
  from split_words_and_search_using_api.rb:23:in `<main>'

Somehow, I need to convert the character (UTF-8) into some valid form
for URL.

Could anybody suggest how to do that?

soichi

···

--
Posted via http://www.ruby-forum.com/.

Quoting Soichi Ishida (lists@ruby-forum.com):

I want to parse a page like

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=???
the url contains non-ascii character as a query. In this particular
case, it's Chinese.

If I try to open this page like

doc = Nokogiri::HTML(open(query)).read

Try this (query must contain the correct UTF-8):

require 'webrick/httputils'

..
..

query.force_encoding('binary')
query=WEBrick::HTTPUtils.escape(query)
doc=Nokogiri::HTML(open(query)).read

Carlo

···

Subject: Open-uri with non-ascii character
  Date: Sun 06 Jan 13 12:03:01PM +0900

--
  * Se la Strada e la sua Virtu' non fossero state messe da parte,
* K * Carlo E. Prelz - fluido@fluido.as che bisogno ci sarebbe
  * di parlare tanto di amore e di rettitudine? (Chuang-Tzu)

Thanks. Now I can open the site.
I will be able to parse it then.

soichi

···

--
Posted via http://www.ruby-forum.com/.

Interestingly enough, if you look at what gets sent through as the
link above, it's:

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿

which is also what you'd obtain via:

URI.escape("http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿&quot;\)

···

On Sat, Jan 5, 2013 at 9:03 PM, Soichi Ishida <lists@ruby-forum.com> wrote:

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿
the url contains non-ascii character as a query. In this particular
case, it's Chinese.