Is there link extractor or similar html processing libs for Ruby

Desireco · 7 March 2006 17:23

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic

Ross_Bamford4 · 7 March 2006 17:35

Maybe try:

Rubyful Soup: "The brush has got entangled in it!"

···

On Wed, 2006-03-08 at 02:23 +0900, Desireco wrote:

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

Marcin_Mielzynski · 7 March 2006 17:38

Desireco wrote:

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic http://www.dakic.com

You meant something like this ? (quite dirty but works)

puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

lopex

W_James · 8 March 2006 01:48

Desireco wrote:

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic
http://www.dakic.com

class String
  def xtag(s)
    result =
    scan( %r!
              < #{s} (?: \s+ ( [^>]* ) )? / >
              >
              < #{s} (?: \s+ ( [^>]* ) )? >
              ( .*? ) </ #{s} >
          !mix ) \
      { |unpaired, attr, data| h = { }
        ( unpaired || attr || "" ).
        scan( %r{ ( \w+ ) \s* = \s*
                   (?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
                }x ) { |k,q,v,v2|
          h[k.downcase] = (v || v2) }
        block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
      }
    result
  end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
  <a
  href = "alert('Junior broke it!')" >foo bar</a>
  <a
  href = www.foo.bar >foo bar
  </a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
  <a target="_blank" href="/support?hl=en">Help</a> |

Gregor_Kopp · 8 March 2006 09:03

gem install mechanize

require 'mechanize'
browser = WWW::Mechanize.new
url = "http://www.eineseite.de"
page = browser.get url
page.links.each do |link|
puts "#{url}#{link.href}"
end

take also a look at html tokenizer from gems

Desireco schrieb:

···

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic http://www.dakic.com

James_Edward_Gray_II · 7 March 2006 17:46

No, it doesn't, trust me. Toss a simple "\n" in there and you're sunk:

Parsing HTML is hard and you don't want to use regular expressions to do it.

James Edward Gray II

···

On Mar 7, 2006, at 11:38 AM, Marcin Mielżyński wrote:

Desireco wrote:

Hi,
in cool Perl there are a bunch of libraries that process html files and
help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me
if he could point me in right direction. Basically I need to extract
links and info from html pages.
Thanks. Zeljko Dakic http://www.dakic.com

You meant something like this ? (quite dirty but works)

puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

Desireco · 7 March 2006 18:33

Thank you guys. RubyfulSoup looks like what I am after.

Zeljko

Gregor_Kopp · 8 March 2006 09:08

Gregor Kopp schrieb:

take also a look at html tokenizer from gems

or do a gem search html

James_Edward_Gray_II · 8 March 2006 14:01

<a href="if (my_var > 5) { whatever() }">Javascript Link</a>

James Edward Gray II

···

On Mar 7, 2006, at 7:48 PM, William James wrote:

class String
  def xtag(s)
    result =
    scan( %r!
              < #{s} (?: \s+ ( [^>]* ) )? / >
              >
              < #{s} (?: \s+ ( [^>]* ) )? >
              ( .*? ) </ #{s} >
          !mix ) \
      { |unpaired, attr, data| h = { }
        ( unpaired || attr || "" ).
        scan( %r{ ( \w+ ) \s* = \s*
                   (?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
                }x ) { |k,q,v,v2|
          h[k.downcase] = (v || v2) }
        block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
      }
    result
  end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
  <a
  href = "alert('Junior broke it!')" >foo bar</a>
  <a
  href = www.foo.bar >foo bar
  </a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
  <a target="_blank" href="/support?hl=en">Help</a> |

Marcin_Mielzynski · 7 March 2006 18:58

James Edward Gray II wrote:

puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

No, it doesn't, trust me. Toss a simple "\n" in there and you're sunk:

<a
href="whatever">

Parsing HTML is hard and you don't want to use regular expressions to do it.

James Edward Gray II

Yep, I realized that after seeing xerces sources

lopex

Bill_Kelly · 7 March 2006 19:36

>
> You meant something like this ? (quite dirty but works)
>
> puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

No, it doesn't, trust me. Toss a simple "\n" in there and you're sunk:

<a
href="whatever">

Parsing HTML is hard and you don't want to use regular expressions to do it.

Hi, not trying to be argumentative, just surprised. I thought parsing HTML with regexps was pretty easy. Well, lexing HTML into tokens, I mean.

Since there are no recursive structures (that I know of) in the syntax for
an open or closing tag, it seemed reasonably well suited to regexps to me.

. . . . Heheh, or maybe the passage of time has given the memories a
rosy glow. I just looked up the last HTML lexer I wrote, 5 years ago, and it's 19 lines of regexp. Admittedlly it's a very clean 19 lines, but still,
lengthier than I remembered....

Regards,

Bill

···

From: "James Edward Gray II" <james@grayproductions.net>

Pistos_Christou1 · 7 March 2006 20:03

James Gray wrote:

Parsing HTML is hard and you don't want to use regular expressions to
do it.

Rubyful Soup looks great! I'm going to give it a whirl. And I've been
doing it the "hard and you don't want to use regexp" way all this time!
Relatively successfully, mind you, but this looks even better.

Gentoo users: I made some renegade ebuilds for Rubyful Soup:

http://www.ebuildexchange.org/catview.php?sh_cat_f=dev-ruby

Pistos

···

--
Posted via http://www.ruby-forum.com/\.

W_James · 10 March 2006 08:33

James Edward Gray II wrote:

> class String
> def xtag(s)
> result =
> scan( %r!
> < #{s} (?: \s+ ( [^>]* ) )? / >
> >
> < #{s} (?: \s+ ( [^>]* ) )? >
> ( .*? ) </ #{s} >
> !mix ) \
> { |unpaired, attr, data| h = { }
> ( unpaired || attr || "" ).
> scan( %r{ ( \w+ ) \s* = \s*
> (?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
> }x ) { |k,q,v,v2|
> h[k.downcase] = (v || v2) }
> block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
> }
> result
> end
> end
>
> DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }
>
> __END__
> <a
> href = "alert('Junior broke it!')" >foo bar</a>
> <a
> href = www.foo.bar >foo bar
> </a>
> upcoming <A HREF="./">HTML 3.2 reference</A>. All the
> is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
> <a target="_blank" href="/support?hl=en">Help</a> |

<a href="if (my_var > 5) { whatever() }">Javascript Link</a>

class String
  def xtag(str)
    result = ; re =
     %r{ < #{str} (?: \s+ ( (?> [^>"/]* (?> "[^"]*" )? )* ) )? }xi
    scan( %r{ #{re} / > | #{re} > ( .*? ) </ #{str} >
            }mix ) \
      { |unpaired, attr, data| h = { }
        ( unpaired || attr || "" ).
        scan( %r{ ( \w+ ) \s* = \s*
                   (?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
                }mx ) { |k,q,v,v2|
          h[k.downcase] = (v || v2) }
        block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
      }
    result
  end
end

DATA.read.xtag('a'){|atr,txt| puts "-"*9; p atr['href']; puts txt }

__END__
  <a
  href = "alert('Junior broke it!')" >foo bar</a>
  <a
  href = www.foo.bar >foo bar
  </a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF="./special/a.html">A</A>, with the attribute HREF.
  <a target="_blank" href="/support?hl=en">Help</a> |
<a href="if (my_var > 5) { whatever() }">Javascript Link</a>
<a name = "foo-bar"
  href = "if (foo_bar > 14)
    { fluct() }"

···

On Mar 7, 2006, at 7:48 PM, William James wrote:

>Javascript "circumlocutory" Link</a>

James_Edward_Gray_II · 7 March 2006 20:06

There's a lot of pretty darn ugly HTML out there my friend. Here's a semi-paranoid attempt to grab just the start of anchor tag:

/<\s*a[^>]*?href\s*=\s*(['"]?)[^'"]*\1?[^>]*>/i

Am I getting close yet? No, the quotes are all wrong. That would fail to match an extremely common link like:

I would try to fix that, but my brain has already melted and leaked out my ear. I'm sure I made other mistakes too.

If you want to capture the name of the link too, this gets *much* worse!

James Edward Gray II

···

On Mar 7, 2006, at 1:36 PM, Bill Kelly wrote:

From: "James Edward Gray II" <james@grayproductions.net>

>
> You meant something like this ? (quite dirty but works)
>
> puts open("some.html").read.scan(/<a href="?(.+?)"?>/)
No, it doesn't, trust me. Toss a simple "\n" in there and you're sunk:
<a
href="whatever">
Parsing HTML is hard and you don't want to use regular expressions to do it.

Hi, not trying to be argumentative, just surprised. I thought parsing HTML with regexps was pretty easy. Well, lexing HTML into tokens, I mean.

Christian_Neukirche1 · 8 March 2006 17:36

"Bill Kelly" <billk@cts.com> writes:

From: "James Edward Gray II" <james@grayproductions.net>

>
> You meant something like this ? (quite dirty but works)
>
> puts open("some.html").read.scan(/<a href="?(.+?)"?>/)
No, it doesn't, trust me. Toss a simple "\n" in there and
you're sunk:
<a
href="whatever">
Parsing HTML is hard and you don't want to use regular expressions
to do it.

Hi, not trying to be argumentative, just surprised. I thought parsing
HTML with regexps was pretty easy. Well, lexing HTML into tokens, I
mean.

Lex, yes. Scrape in general, no.

(And those who think that's BS, please have a look at REXML.)

···

Regards,

Bill

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Pistos_Christou1 · 7 March 2006 20:12

James Gray wrote:

Am I getting close yet? No, the quotes are all wrong. That would
fail to match an extremely common link like:

If you want to capture the name of the link too, this gets *much* worse!

I see what you're getting at: If you're trying to do
generally-applicable parsing, I suppose you're headed for a world of
hurt. But all I've ever done is page- or site-specific scraping, and
never really considered it a big deal. A few regexps here, a few .scans
there, and you're done...

Pistos

···

--
Posted via http://www.ruby-forum.com/\.

James_Edward_Gray_II · 7 March 2006 20:17

Or you can load RubyfulSoup and call find() a few times. About they same effort, but a *lot* safer, eh?

James Edward Gray II

···

On Mar 7, 2006, at 2:12 PM, Pistos Christou wrote:

A few regexps here, a few .scans there, and you're done...

Topic		Replies	Views
Decent HTML Parser? ruby-talk	17	114	13 July 2006
How to extract texts from html source? ruby-talk	13	133	2 June 2005
Ruby screen scraping ruby-talk	27	108	21 November 2006
Scan HTML ruby-talk	15	82	3 March 2008
Regular expression ruby-talk	7	123	23 March 2009

Is there link extractor or similar html processing libs for Ruby

Related topics