HTML Parser: Which one is better?

ZHANG_Yin · 31 May 2007 07:28

I'm new to Ruby and need to parse some web pages. I googled "ruby HTML
parser" and have found several parser avaliable. They all seem good and
I'm wondering which one is better for me since I'll have to deal with
many pages encoded in different encoding, such UTF-8, GB2312 and GBK(For
Chinese). So please help me. Thanks.

···

--
Posted via http://www.ruby-forum.com/.

Dick_Davies · 31 May 2007 07:47

Hpricot is a good starting point.

···

On 31/05/07, Zhang Yin <gsofhon@gmail.com> wrote:

I'm new to Ruby and need to parse some web pages. I googled "ruby HTML
parser" and have found several parser avaliable. They all seem good and
I'm wondering which one is better for me since I'll have to deal with
many pages encoded in different encoding, such UTF-8, GB2312 and GBK(For
Chinese). So please help me. Thanks.

--
Posted via http://www.ruby-forum.com/\.

--
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Sy_Ys · 1 June 2007 01:11

Rubyful soup I like. Its highly simple to use although the construction
of the object from HTML is a bit slower than I'd like. Quite easy to
use.

···

--
Posted via http://www.ruby-forum.com/.

ZHANG_Yin · 31 May 2007 09:27

Dick Davies wrote:

Hpricot is a good starting point.

OK, I got it. Thanks a lot.

···

--
Posted via http://www.ruby-forum.com/\.

Richard_Conroy1 · 31 May 2007 09:36

Hpricot is a good starting point.

Yeah Hpricot is good, but in general the quality of the Ruby web scraping
choices is pretty impressive. There are variants that are just built on top
of Hpricot but provide an even simpler API.

However your second problem is a bit trickier, where you encounter
alternate encodings. To do any kind of real work with multiple code
pages you want to be converting it to unicode (UTF-8) at fetch time.

This isn't Ruby's strong point (which is not the same thing as saying
it can't do it). But there are multiple choices here - running Ruby on
JRuby (Java) just for the seamless unicode/codepage support. Hpricot
is ported to JRuby for instance. I would have a good look at what
Ruby libraries enable explicit code page conversions.

···

On 5/31/07, Dick Davies <rasputnik@gmail.com> wrote:

On 31/05/07, Zhang Yin <gsofhon@gmail.com> wrote:
> I'm new to Ruby and need to parse some web pages. I googled "ruby HTML
> parser" and have found several parser avaliable. They all seem good and
> I'm wondering which one is better for me since I'll have to deal with
> many pages encoded in different encoding, such UTF-8, GB2312 and GBK(For
> Chinese). So please help me. Thanks.
>
> --
> Posted via http://www.ruby-forum.com/\.
>

--
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Erik_Hollensbe · 1 June 2007 06:15

I've had great success with this. Just make sure you're using a later version of Ruby 1.8.5+ (that includes the NKF library) and you should be fine.

···

On 2007-05-31 02:36:57 -0700, "Richard Conroy" <richard.conroy@gmail.com> said:

On 5/31/07, Dick Davies <rasputnik@gmail.com> wrote:

Hpricot is a good starting point.

Yeah Hpricot is good, but in general the quality of the Ruby web scraping
choices is pretty impressive. There are variants that are just built on top
of Hpricot but provide an even simpler API.

However your second problem is a bit trickier, where you encounter
alternate encodings. To do any kind of real work with multiple code
pages you want to be converting it to unicode (UTF-8) at fetch time.

ZHANG_Yin · 2 June 2007 05:12

Thank you all for your help.

···

--
Posted via http://www.ruby-forum.com/.

Jerry_Blanco · 2 June 2007 19:34

I've used HPricot, and really like it.

···

On 6/1/07, ZHANG Yin <gsofhon@gmail.com> wrote:

Thank you all for your help.

--
Posted via http://www.ruby-forum.com/\.

--
/(bb|[^b]{2})/ <- The question

Topic		Replies	Views
Hpricot/Rubyful Soup comparison ruby-talk	18	49	25 November 2006
[ANN] Hpricot 0.6 -- the swift, delightful HTML parser ruby-talk	0	119	16 June 2007
Documentation for HTMLParser ruby-talk	0	60	25 April 2007
[ANN] hpricot 0.5 -- a fast, forgiving HTML reader ruby-talk	7	113	11 May 2007
Noob, html trees & parsing ruby-talk	1	89	13 June 2009

HTML Parser: Which one is better?

Related topics