Is there link extractor or similar html processing libs for Ruby

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic

Maybe try:

  Rubyful Soup: "The brush has got entangled in it!"

···

On Wed, 2006-03-08 at 02:23 +0900, Desireco wrote:

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

Desireco wrote:

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic http://www.dakic.com

You meant something like this ? (quite dirty but works)

puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

lopex

Desireco wrote:

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic
http://www.dakic.com

class String
  def xtag(s)
    result =
    scan( %r!
              < #{s} (?: \s+ ( [^>]* ) )? / >
              >
              < #{s} (?: \s+ ( [^>]* ) )? >
              ( .*? ) </ #{s} >
          !mix ) \
      { |unpaired, attr, data| h = { }
        ( unpaired || attr || "" ).
        scan( %r{ ( \w+ ) \s* = \s*
                   (?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
                }x ) { |k,q,v,v2|
          h[k.downcase] = (v || v2) }
        block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
      }
    result
  end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
  <a
  href = "alert('Junior broke it!')" >foo bar</a>
  <a
  href = www.foo.bar >foo bar
  </a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
  <a target="_blank" href="/support?hl=en">Help</a> |

gem install mechanize

require 'mechanize'
browser = WWW::Mechanize.new
url = "http://www.eineseite.de"
page = browser.get url
page.links.each do |link|
   puts "#{url}#{link.href}"
end

take also a look at html tokenizer from gems

Desireco schrieb:

···

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic http://www.dakic.com

No, it doesn't, trust me. :wink: Toss a simple "\n" in there and you're sunk:

<a
  href="whatever">

Parsing HTML is hard and you don't want to use regular expressions to do it.

James Edward Gray II

···

On Mar 7, 2006, at 11:38 AM, Marcin Mielżyński wrote:

Desireco wrote:

Hi,
in cool Perl there are a bunch of libraries that process html files and
help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me
if he could point me in right direction. Basically I need to extract
links and info from html pages.
Thanks. Zeljko Dakic http://www.dakic.com

You meant something like this ? (quite dirty but works)

puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

Thank you guys. RubyfulSoup looks like what I am after.

Zeljko

Gregor Kopp schrieb:

take also a look at html tokenizer from gems

or do a gem search html :wink:

<a href="if (my_var > 5) { whatever() }">Javascript Link</a>

James Edward Gray II

···

On Mar 7, 2006, at 7:48 PM, William James wrote:

class String
  def xtag(s)
    result =
    scan( %r!
              < #{s} (?: \s+ ( [^>]* ) )? / >
              >
              < #{s} (?: \s+ ( [^>]* ) )? >
              ( .*? ) </ #{s} >
          !mix ) \
      { |unpaired, attr, data| h = { }
        ( unpaired || attr || "" ).
        scan( %r{ ( \w+ ) \s* = \s*
                   (?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
                }x ) { |k,q,v,v2|
          h[k.downcase] = (v || v2) }
        block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
      }
    result
  end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
  <a
  href = "alert('Junior broke it!')" >foo bar</a>
  <a
  href = www.foo.bar >foo bar
  </a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
  <a target="_blank" href="/support?hl=en">Help</a> |

James Edward Gray II wrote:

puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

No, it doesn't, trust me. :wink: Toss a simple "\n" in there and you're sunk:

<a
href="whatever">

Parsing HTML is hard and you don't want to use regular expressions to do it.

James Edward Gray II

Yep, I realized that after seeing xerces sources :smiley:

lopex

>
> You meant something like this ? (quite dirty but works)
>
> puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

No, it doesn't, trust me. :wink: Toss a simple "\n" in there and you're sunk:

<a
  href="whatever">

Parsing HTML is hard and you don't want to use regular expressions to do it.

Hi, not trying to be argumentative, just surprised. I thought parsing HTML with regexps was pretty easy. Well, lexing HTML into tokens, I mean.

Since there are no recursive structures (that I know of) in the syntax for
an open or closing tag, it seemed reasonably well suited to regexps to me.

. . . . Heheh, or maybe the passage of time has given the memories a
rosy glow. I just looked up the last HTML lexer I wrote, 5 years ago, and it's 19 lines of regexp. Admittedlly it's a very clean 19 lines, but still,
lengthier than I remembered.... :slight_smile:

Regards,

Bill

···

From: "James Edward Gray II" <james@grayproductions.net>

James Gray wrote:

Parsing HTML is hard and you don't want to use regular expressions to
do it.

Rubyful Soup looks great! I'm going to give it a whirl. And I've been
doing it the "hard and you don't want to use regexp" way all this time!
:slight_smile: Relatively successfully, mind you, but this looks even better.

Gentoo users: I made some renegade ebuilds for Rubyful Soup:

http://www.ebuildexchange.org/catview.php?sh_cat_f=dev-ruby

Pistos

···

--
Posted via http://www.ruby-forum.com/\.

James Edward Gray II wrote:

> class String
> def xtag(s)
> result =
> scan( %r!
> < #{s} (?: \s+ ( [^>]* ) )? / >
> >
> < #{s} (?: \s+ ( [^>]* ) )? >
> ( .*? ) </ #{s} >
> !mix ) \
> { |unpaired, attr, data| h = { }
> ( unpaired || attr || "" ).
> scan( %r{ ( \w+ ) \s* = \s*
> (?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
> }x ) { |k,q,v,v2|
> h[k.downcase] = (v || v2) }
> block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
> }
> result
> end
> end
>
> DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }
>
> __END__
> <a
> href = "alert('Junior broke it!')" >foo bar</a>
> <a
> href = www.foo.bar >foo bar
> </a>
> upcoming <A HREF="./">HTML 3.2 reference</A>. All the
> is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
> <a target="_blank" href="/support?hl=en">Help</a> |

<a href="if (my_var > 5) { whatever() }">Javascript Link</a>

class String
  def xtag(str)
    result = ; re =
     %r{ < #{str} (?: \s+ ( (?> [^>"/]* (?> "[^"]*" )? )* ) )? }xi
    scan( %r{ #{re} / > | #{re} > ( .*? ) </ #{str} >
            }mix ) \
      { |unpaired, attr, data| h = { }
        ( unpaired || attr || "" ).
        scan( %r{ ( \w+ ) \s* = \s*
                   (?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
                }mx ) { |k,q,v,v2|
          h[k.downcase] = (v || v2) }
        block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
      }
    result
  end
end

DATA.read.xtag('a'){|atr,txt| puts "-"*9; p atr['href']; puts txt }

__END__
  <a
  href = "alert('Junior broke it!')" >foo bar</a>
  <a
  href = www.foo.bar >foo bar
  </a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF="./special/a.html">A</A>, with the attribute HREF.
  <a target="_blank" href="/support?hl=en">Help</a> |
<a href="if (my_var > 5) { whatever() }">Javascript Link</a>
<a name = "foo-bar"
  href = "if (foo_bar > 14)
    { fluct() }"

···

On Mar 7, 2006, at 7:48 PM, William James wrote:

  >Javascript "circumlocutory" Link</a>

There's a lot of pretty darn ugly HTML out there my friend. Here's a semi-paranoid attempt to grab just the start of anchor tag:

/<\s*a[^>]*?href\s*=\s*(['"]?)[^'"]*\1?[^>]*>/i

Am I getting close yet? No, the quotes are all wrong. That would fail to match an extremely common link like:

<a href="alert('You broke it!')">

I would try to fix that, but my brain has already melted and leaked out my ear. :slight_smile: I'm sure I made other mistakes too.

If you want to capture the name of the link too, this gets *much* worse!

James Edward Gray II

···

On Mar 7, 2006, at 1:36 PM, Bill Kelly wrote:

From: "James Edward Gray II" <james@grayproductions.net>

>
> You meant something like this ? (quite dirty but works)
>
> puts open("some.html").read.scan(/<a href="?(.+?)"?>/)
No, it doesn't, trust me. :wink: Toss a simple "\n" in there and you're sunk:
<a
  href="whatever">
Parsing HTML is hard and you don't want to use regular expressions to do it.

Hi, not trying to be argumentative, just surprised. I thought parsing HTML with regexps was pretty easy. Well, lexing HTML into tokens, I mean.

"Bill Kelly" <billk@cts.com> writes:

From: "James Edward Gray II" <james@grayproductions.net>

>
> You meant something like this ? (quite dirty but works)
>
> puts open("some.html").read.scan(/<a href="?(.+?)"?>/)
No, it doesn't, trust me. :wink: Toss a simple "\n" in there and
you're sunk:
<a
  href="whatever">
Parsing HTML is hard and you don't want to use regular expressions
to do it.

Hi, not trying to be argumentative, just surprised. I thought parsing
HTML with regexps was pretty easy. Well, lexing HTML into tokens, I
mean.

Lex, yes. Scrape in general, no.

(And those who think that's BS, please have a look at REXML.)

···

Regards,

Bill

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

James Gray wrote:

Am I getting close yet? No, the quotes are all wrong. That would
fail to match an extremely common link like:

If you want to capture the name of the link too, this gets *much* worse!

I see what you're getting at: If you're trying to do
generally-applicable parsing, I suppose you're headed for a world of
hurt. But all I've ever done is page- or site-specific scraping, and
never really considered it a big deal. A few regexps here, a few .scans
there, and you're done...

Pistos

···

--
Posted via http://www.ruby-forum.com/\.

Or you can load RubyfulSoup and call find() a few times. About they same effort, but a *lot* safer, eh? :wink:

James Edward Gray II

···

On Mar 7, 2006, at 2:12 PM, Pistos Christou wrote:

A few regexps here, a few .scans there, and you're done...