No, it doesn't, trust me. Toss a simple "\n" in there and you're sunk:
<a
href="whatever">
Parsing HTML is hard and you don't want to use regular expressions to do it.
James Edward Gray II
···
On Mar 7, 2006, at 11:38 AM, Marcin Mielżyński wrote:
Desireco wrote:
Hi,
in cool Perl there are a bunch of libraries that process html files and
help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me
if he could point me in right direction. Basically I need to extract
links and info from html pages.
Thanks. Zeljko Dakic http://www.dakic.com
You meant something like this ? (quite dirty but works)
>
> You meant something like this ? (quite dirty but works)
>
> puts open("some.html").read.scan(/<a href="?(.+?)"?>/)
No, it doesn't, trust me. Toss a simple "\n" in there and you're sunk:
<a
href="whatever">
Parsing HTML is hard and you don't want to use regular expressions to do it.
Hi, not trying to be argumentative, just surprised. I thought parsing HTML with regexps was pretty easy. Well, lexing HTML into tokens, I mean.
Since there are no recursive structures (that I know of) in the syntax for
an open or closing tag, it seemed reasonably well suited to regexps to me.
. . . . Heheh, or maybe the passage of time has given the memories a
rosy glow. I just looked up the last HTML lexer I wrote, 5 years ago, and it's 19 lines of regexp. Admittedlly it's a very clean 19 lines, but still,
lengthier than I remembered....
Regards,
Bill
···
From: "James Edward Gray II" <james@grayproductions.net>
Parsing HTML is hard and you don't want to use regular expressions to
do it.
Rubyful Soup looks great! I'm going to give it a whirl. And I've been
doing it the "hard and you don't want to use regexp" way all this time! Relatively successfully, mind you, but this looks even better.
Gentoo users: I made some renegade ebuilds for Rubyful Soup:
There's a lot of pretty darn ugly HTML out there my friend. Here's a semi-paranoid attempt to grab just the start of anchor tag:
/<\s*a[^>]*?href\s*=\s*(['"]?)[^'"]*\1?[^>]*>/i
Am I getting close yet? No, the quotes are all wrong. That would fail to match an extremely common link like:
<a href="alert('You broke it!')">
I would try to fix that, but my brain has already melted and leaked out my ear. I'm sure I made other mistakes too.
If you want to capture the name of the link too, this gets *much* worse!
James Edward Gray II
···
On Mar 7, 2006, at 1:36 PM, Bill Kelly wrote:
From: "James Edward Gray II" <james@grayproductions.net>
>
> You meant something like this ? (quite dirty but works)
>
> puts open("some.html").read.scan(/<a href="?(.+?)"?>/)
No, it doesn't, trust me. Toss a simple "\n" in there and you're sunk:
<a
href="whatever">
Parsing HTML is hard and you don't want to use regular expressions to do it.
Hi, not trying to be argumentative, just surprised. I thought parsing HTML with regexps was pretty easy. Well, lexing HTML into tokens, I mean.
From: "James Edward Gray II" <james@grayproductions.net>
>
> You meant something like this ? (quite dirty but works)
>
> puts open("some.html").read.scan(/<a href="?(.+?)"?>/)
No, it doesn't, trust me. Toss a simple "\n" in there and
you're sunk:
<a
href="whatever">
Parsing HTML is hard and you don't want to use regular expressions
to do it.
Hi, not trying to be argumentative, just surprised. I thought parsing
HTML with regexps was pretty easy. Well, lexing HTML into tokens, I
mean.
Lex, yes. Scrape in general, no.
(And those who think that's BS, please have a look at REXML.)
Am I getting close yet? No, the quotes are all wrong. That would
fail to match an extremely common link like:
If you want to capture the name of the link too, this gets *much* worse!
I see what you're getting at: If you're trying to do
generally-applicable parsing, I suppose you're headed for a world of
hurt. But all I've ever done is page- or site-specific scraping, and
never really considered it a big deal. A few regexps here, a few .scans
there, and you're done...