I want to scrape something of a webpage. It has a massive content and I
want to find the thing that comes after the first occourance of "a
href=" after the occourance of "id=xxx". So:
assuming that you have the following in webpage "a0.html":
…
<id=xxx>
…
<a href=???>
…
···
,
you can run the following script:
my_page=IO.readlines("a0.html").to_s
r1=/<id=xxx>/
r2=/(?=<a href=[^>]+>)/
r3=/<a href=([^>]+)>/
text=my_page.split(r1)
text2=text[1..-1].join.split(r2)[1]
ref=r3.match(text2)
p 'the first link was : ' + ref[1]
I read in the entire page into a string my_page,
split that into an Array at the first occurrence of
regexp r1, join it back again into a string,
then split that into an array using regexp r2, which keeps the
delimiter (of form <a href=[^>]+> ...that's what the (?= .. ) syntax is for) , rather than dropping it, as in the first split.
If there is text before the first occurrence of r3,
you'll find it in the first element of the splitted string:
text[1..-1].join.split(r2)[0],
and the first occurrence of r3 is in the second element
text2.
If you want more information about Regexps, you'll might
find this helpful:
I want to scrape something of a webpage. It has a massive content and I
want to find the thing that comes after the first occourance of "a
href=" after the occourance of "id=xxx". So:
...
<id=xxx>
...
<a href=???>
...
How can I do this?
Thank you!
Jim,
Is the "???" the thing you want to capture? If so, the following should do the trick:
if page.body =~ /<id=xxx>.+?<a href=['"]?([^'"\s>]*)/m
capture = $1
end
If this seems somewhat perlish, it's because a perlmonger taught me this line.
If you going to scrape anything more complex than what can be handled by a
few regular expressions, then you might wan't to take a look a
whytheluckystiff's Hpricot library: http://code.whytheluckystiff.net/hpricot/
On 5/14/07, Jim Kronhamn <jim.kronhamn@gmail.com> wrote:
Hello!
I'm trying to do the following:
I want to scrape something of a webpage. It has a massive content and I
want to find the thing that comes after the first occourance of "a
href=" after the occourance of "id=xxx". So:
if page.body =~ /<id=xxx>.+?<a href=['"]?([^'"\s>]*)/m
capture = $1
end
If this seems somewhat perlish, it's because a perlmonger taught me this
line.
I'm a recovering Perlmonger, and that's exactly what I'd do in that situation.
···
--
Giles Bowkett
I'm running a time management experiment: I'm only checking e-mail
twice per day, at 11am and 5pm. If you need to get in touch quicker
than that, call me on my cell.
It incorporates Hpricot and gives you both a higher-level approach and
a way to drop down to Hpricot if needed. I think it's also going to
incorporate FireWatir in the nearish future, or use it somehow (forgot
details).
···
On 5/15/07, Christian Theil Have <christiantheilhave@gmail.com> wrote:
If you going to scrape anything more complex than what can be handled by a
few regular expressions, then you might wan't to take a look a
whytheluckystiff's Hpricot library: http://code.whytheluckystiff.net/hpricot/
It's excellent for scraping web pages..
--
Giles Bowkett
I'm running a time management experiment: I'm only checking e-mail
twice per day, at 11am and 5pm. If you need to get in touch quicker
than that, call me on my cell.