Open-uri with non-ascii character

7stud2 · 6 January 2013 03:03

ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin10.8.0]

I want to parse a page like

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿
the url contains non-ascii character as a query. In this particular
case, it's Chinese.

If I try to open this page like

doc = Nokogiri::HTML(open(query)).read

it gives an error

/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:176:in
`split': bad URI(is not URI?):
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=日
(URI::InvalidURIError)
  from
/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:211:in
`parse'
  from
/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/uri/common.rb:747:in
`parse'
  from
/Users/soichi/.rvm/rubies/ruby-1.9.3-p286/lib/ruby/1.9.1/open-uri.rb:32:in
`open'
  from split_words_and_search_using_api.rb:23:in `<main>'

Somehow, I need to convert the character (UTF-8) into some valid form
for URL.

Could anybody suggest how to do that?

soichi

···

--
Posted via http://www.ruby-forum.com/.

Carlo_E_Prelz · 6 January 2013 07:16

Quoting Soichi Ishida (lists@ruby-forum.com):

I want to parse a page like

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=???
the url contains non-ascii character as a query. In this particular
case, it's Chinese.

If I try to open this page like

doc = Nokogiri::HTML(open(query)).read

Try this (query must contain the correct UTF-8):

require 'webrick/httputils'

..
..

query.force_encoding('binary')
query=WEBrick::HTTPUtils.escape(query)
doc=Nokogiri::HTML(open(query)).read

Carlo

···

Subject: Open-uri with non-ascii character
Date: Sun 06 Jan 13 12:03:01PM +0900

--
* Se la Strada e la sua Virtu' non fossero state messe da parte,
* K * Carlo E. Prelz - fluido@fluido.as che bisogno ci sarebbe
* di parlare tanto di amore e di rettitudine? (Chuang-Tzu)

7stud2 · 6 January 2013 09:22

Thanks. Now I can open the site.
I will be able to parse it then.

soichi

···

--
Posted via http://www.ruby-forum.com/.

Tamara_Temple1 · 6 January 2013 11:07

Interestingly enough, if you look at what gets sent through as the
link above, it's:

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿

which is also what you'd obtain via:

URI.escape("http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿"\)

···

On Sat, Jan 5, 2013 at 9:03 PM, Soichi Ishida <lists@ruby-forum.com> wrote:

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=儿
the url contains non-ascii character as a query. In this particular
case, it's Chinese.

Topic		Replies	Views
Open-uri, nokogiri and UTF-8 to US-ASCII ruby-talk	5	189	9 November 2012
Open-uri only ascii when opening utf-8 encoded site ruby-talk	0	142	21 October 2009
Open-uri and utf8 ruby-talk	1	114	27 February 2007
How to parse a unicode url? ruby-talk	7	397	26 September 2007
Inconsistent IO character reading when converting encoding ruby-talk	0	152	10 June 2013

Open-uri with non-ascii character

Related topics