Newbie to Ruby

Charles_Pareto · 27 August 2007 03:29

Hi,
I was reading through Learning Ruby and was trying to get the example on
page 119 to work to scrape Google. But when I run it nothing happens.
Any help would be appreciated. Thanks.

require 'open-uri'

url = "http://www.google.com/search?q=ruby"

open(url) { |page| page_content = page.read()

links = page_content.scan(/<a class=1.*?href=\"(.*?)\"/).flatten

links.each {|link| puts link}

}

···

--
Posted via http://www.ruby-forum.com/.

Mark_Gallop1 · 27 August 2007 04:01

Hi Charles,

Charles Pareto wrote:

links = page_content.scan(/<a class=1.*?href=\"(.*?)\"/).flatten

I don't think that regular expression (regexp) works. Maybe google has changed their code since the book was written. I think it goes "href" then "class".

If you work out the correct regexp, let us know.

Cheers,
Mark

Dan_Zwell · 27 August 2007 04:26

Mark Gallop wrote:

Hi Charles,

Charles Pareto wrote:

links = page_content.scan(/<a class=1.*?href=\"(.*?)\"/).flatten

I don't think that regular expression (regexp) works. Maybe google has changed their code since the book was written. I think it goes "href" then "class".

If you work out the correct regexp, let us know.

Cheers,
Mark

As Mark said, google changed their code somewhat. If you work out the correct regular expression and it still seems to give erratic results, here is a hint: the naive solution uses ".*?" in a certain place, but that will still match too many results. Try [^"]*? instead, because you probably don't want to match quotes. (I just tried this, and that was the problem I encountered.)

By the way, a robust regex to match all HTML links looks kind of nasty, but perhaps you should try writing one--it's a good exercise. (Of course, that's not what you want for this--you want to match all links of class=l.)

Regards,
Dan

James_Britt · 27 August 2007 05:13

As those guys said, Google probably changed their code since the book was written.
That's not to prevent web scraping, it's just that web sites are pretty transitory. They change all the time and very easily. This makes web sophisticated web scraping a moving target.

Jaime_Iniesta · 27 August 2007 18:22

Yes, web scraping using just open-uri and regular expressions is
pretty low-level.

Try Hpricot or scRUBYt for a higher level, more flexible, scraping.¡

···

2007/8/27, John Joyce <dangerwillrobinsondanger@gmail.com>:

That's not to prevent web scraping, it's just that web sites are
pretty transitory. They change all the time and very easily. This
makes web sophisticated web scraping a moving target.

--
Jaime Iniesta
http://jaimeiniesta.com - http://railes.net - http://freelancegirona.com

Charles_Pareto · 29 August 2007 22:15

Jaime Iniesta wrote:

···

2007/8/27, John Joyce <dangerwillrobinsondanger@gmail.com>:

That's not to prevent web scraping, it's just that web sites are
pretty transitory. They change all the time and very easily. This
makes web sophisticated web scraping a moving target.

Yes, web scraping using just open-uri and regular expressions is
pretty low-level.

Try Hpricot or scRUBYt for a higher level, more flexible, scraping.�

So I was trying out what everyone said and I got it to work. Here is
what I did.

page_content.scan(/<a href=\"([^"]*?)\" class=l[^"]*?/).flatten
--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Regex problem, probably simple ruby-talk	6	111	16 May 2007
Gathering Links ruby-talk	1	84	19 January 2006
String.scan failure when match works ruby-talk	4	134	1 April 2012
Newbie read.scan (?) question ruby-talk	15	79	7 June 2005
Regexp Help ruby-talk	5	121	28 July 2009

Newbie to Ruby

Related topics