HTML parser Hpricot? and how to get all text

Kenneth · 29 October 2007 13:29

Would a good HTML parser be Hpricot? I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags).

···

--
Posted via http://www.ruby-forum.com/.

mortee · 29 October 2007 14:33

SpringFlowers AutumnMoon wrote:

Would a good HTML parser be Hpricot?

It definitely is.

I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags).

It looks like #inner_text removes all tags and what remains is the plain
text content. Note that it won't convert 's and 's to newlines -
it really just strips tags. If you want more sophisticated text results,
you should iterate over the elements, and implement your logic for
specific ones.

mortee

Thomas_Wieczorek · 29 October 2007 15:22

Would a good HTML parser be Hpricot?

It is a good and fast HTML and XML parser.

I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags).

Mortee's is a quick way to do it. If you need more information to it,
take a look at http://code.whytheluckystiff.net/hpricot or ask on
hpricot's mailing list.

···

2007/10/29, SpringFlowers AutumnMoon <summercoolness@gmail.com>:

Phlip1 · 29 October 2007 18:53

SpringFlowers AutumnMoon wrote:

Would a good HTML parser be Hpricot?

It's extremely good; try it and see!

I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags).

.each_element( './/text()' ){}.join() might do it.

···

--
Phlip

Kenneth · 30 October 2007 14:47

Phlip wrote:

SpringFlowers AutumnMoon wrote:

Would a good HTML parser be Hpricot?


It's extremely good; try it and see!

  I wonder if anyone knows an easy
way for it to get all text of an HTML file? (removing all formatting
tags).

.each_element( './/text()' ){}.join() might do it.

anyone knows where to go from:

require 'hpricot'
doc = Hpricot("hello world")

and what can i do to get "hello world"?

in
http://code.whytheluckystiff.net/hpricot/wiki/HpricotChallenge#StripallHTMLtags
it says just use

str=doc.to_s
print str.gsub(/<\/?[^>]*>/, "")

but can't the < > be nested in some HTML code? If it is nested then
the above won't work, it seems.

···

--
Posted via http://www.ruby-forum.com/\.

Kenneth · 30 October 2007 14:50

by the way

require 'hpricot'

doc = Hpricot("hello world")

p doc.search("").inner_text

won't work... i am not sure if it is the Win installer of Ruby... but it
is the most recent Win installer.

it says

scraper2.rb:6: undefined method `inner_text' for
#<Hpricot::Elements:0x348dbc4>
(NoMethodError)

and doc.to_plain_text() won't work either...

···

--
Posted via http://www.ruby-forum.com/.

7stud · 30 October 2007 20:33

SpringFlowers AutumnMoon wrote:

in
http://code.whytheluckystiff.net/hpricot/wiki/HpricotChallenge#StripallHTMLtags
it says just use

str=doc.to_s
print str.gsub(/<\/?[^>]*>/, "")

but can't the < > be nested in some HTML code? If it is nested then
the above won't work, it seems.

What do you mean by nested? I would consider your example as containing
nested tags:

hello world"

and the regex removes all the tags from that string. html can look like
this:

<h2hel<b></h2llo<h1<b>>worl</h1>

What do you want to do with that string?

···

--
Posted via http://www.ruby-forum.com/\.

mortee · 30 October 2007 20:13

SpringFlowers AutumnMoon wrote:

by the way

require 'hpricot'

doc = Hpricot("hello world")

p doc.search("").inner_text

won't work... i am not sure if it is the Win installer of Ruby... but it
is the most recent Win installer.

it says

scraper2.rb:6: undefined method `inner_text' for
#<Hpricot::Elements:0x348dbc4>
(NoMethodError)

and doc.to_plain_text() won't work either...

$ uname -s
CYGWIN_NT-5.1
$ gem list hpricot

*** LOCAL GEMS ***

hpricot (0.6, 0.5)
a swift, liberal HTML parser with a fantastic library
$ irb
irb(main):001:0> require 'hpricot'
=> true
irb(main):002:0> d = Hpricot("hello world")
=> #<Hpricot::Doc {elem "hello " {elem "world" } }>
irb(main):003:0> d.inner_text
=> "hello world"

···

-------------------------------------------------------------------

C:\>systeminfo
...
OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 2 Build 2600
...
C:\>gem list hpricot

*** LOCAL GEMS ***

hpricot (0.6, 0.5, 0.4)
a swift, liberal HTML parser with a fantastic library

C:\>irb
irb(main):001:0> require 'hpricot'
=> true
irb(main):002:0> d = Hpricot("hello world")
=> #<Hpricot::Doc {elem "hello " {elem "world" } }>
irb(main):003:0> d.inner_text
=> "hello world"

mortee

Kenneth · 31 October 2007 07:21

i just wonder if there would be any case with... the style, etc... the
quote, double quote, and some where, there is < or > inside of a beginning
tag... just hard to say...

also, removing the tag won't work to remove the CSS style or javascript
too...

···

On 10/30/07, 7stud -- <bbxx789_05ss@yahoo.com> wrote:

What do you mean by nested? I would consider your example as containing
nested tags:

hello world"

and the regex removes all the tags from that string. html can look like
this:

<h2hel<b></h2llo<h1<b>>worl</h1>

Kenneth · 31 October 2007 07:18

yup, mine is

C:\>gem list hpricot

*** LOCAL GEMS ***

hpricot (0.4)
a swift, liberal HTML parser with a fantastic library

and d.inner_text or d.text both won't work.

···

On 10/30/07, mortee <mortee.lists@kavemalna.hu> wrote:

C:\>gem list hpricot

*** LOCAL GEMS ***

hpricot (0.6, 0.5, 0.4)
a swift, liberal HTML parser with a fantastic library

C:\>irb
irb(main):001:0> require 'hpricot'
=> true
irb(main):002:0> d = Hpricot("hello world")
=> #<Hpricot::Doc {elem "hello " {elem "world" } }>
irb(main):003:0> d.inner_text
=> "hello world"

mortee

mortee · 31 October 2007 16:03

kendear wrote:

···

On 10/30/07, mortee <mortee.lists@kavemalna.hu> wrote:

C:\>gem list hpricot

*** LOCAL GEMS ***

hpricot (0.6, 0.5, 0.4)
a swift, liberal HTML parser with a fantastic library

C:\>irb
irb(main):001:0> require 'hpricot'
=> true
irb(main):002:0> d = Hpricot("hello world")
=> #<Hpricot::Doc {elem "hello " {elem "world" } }>
irb(main):003:0> d.inner_text
=> "hello world"

mortee

yup, mine is

C:\>gem list hpricot

*** LOCAL GEMS ***

hpricot (0.4)
a swift, liberal HTML parser with a fantastic library

and d.inner_text or d.text both won't work.

Does something prevent you from upgrading?

mortee

Kenneth · 3 November 2007 07:21

mortee wrote:

kendear wrote:

irb(main):001:0> require 'hpricot'

yup, mine is

C:\>gem list hpricot

*** LOCAL GEMS ***

hpricot (0.4)
a swift, liberal HTML parser with a fantastic library

and d.inner_text or d.text both won't work.

Does something prevent you from upgrading?

I finally got the time to upgrade to Hpricot 6.0
so now, the following

require 'net/http'
require 'hpricot'

r = ""

Net::HTTP.start("www.google.com") do |http|
r = http.get("/")
end

c = Hpricot(r.body)
p c.to_plain_text

will work, and so will

p c.inner_text

as the last line. however, the CSS and Javascript lines are not
removed. So I think I can gsub the CSS and Javascript blocks with the
multiline regexp gsub.

I wonder though if there is a quick way, that will do what the lynx on
UNIX does... just print out a plain and readable text page.

···

--
Posted via http://www.ruby-forum.com/\.

Kenneth · 3 November 2007 08:10

however, the CSS and Javascript lines are not
removed. So I think I can gsub the CSS and Javascript blocks with the
multiline regexp gsub.

I wonder though if there is a quick way, that will do what the lynx on
UNIX does... just print out a plain and readable text page.

i got it to work till:

require 'open-uri'
require 'hpricot'

c = open('http://www.google.com').read

c.gsub!(/<style.*?<\/style.*?>/m, " ")
c.gsub!(/<script.*?<\/script.*?>/m, " ")

c.gsub!(/<(span|tr|td| ).*?>/, " ")
c.gsub!(/<(br|p|div|table).*?>/, "\n")

d = Hpricot(c).inner_text
d.gsub!(/\s+/, " ")
d.gsub!(/\n+/, "\n")

print d

but it is not so pretty. and it is not filtering the non-printable
character too.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Hpricot html parsing ruby-talk	12	92	18 December 2006
Hpricot problem ruby-talk	10	74	18 December 2006
Yet another Hpricot question ruby-talk	5	81	12 October 2006
Html to plain text ruby-talk	4	88	24 June 2007
Hpricot question ruby-talk	0	79	30 January 2008

HTML parser Hpricot? and how to get all text

Related topics