Hi,
I have been trying for some time to crack a problem which I am sure if very
simple! I am learning Ruby and programming in general and having lots of
fun. But help would be appreciated...
I am writing a program to scrape body text from a series of web pages so
that they can be presented in a text file.
The format for the URLs of the series of pages I am interested in is:
www.targeturl.com/episode_x?=page1
...
...
www.targeturl.com/episode_x?=page[y]
Basically, "episode_x" has y number of pages, starting at 1.
I am using Nokogiri to grab the text from the page and can quite easily get
the text from page1, but I want to loop through page2, grab its text, page3,
grab its text, etc, until I reach page[y] which is where the text ends, and
to Nokogiri - this means there is no more text on that page (i.e. body_text
== nil).
Before attempting to grab the body text and append to a text file, my
strategy is to populate an array of 'valid' urls, based on a test which
involves Nokogiri finding text in the body tag, starting at page1. I want
the loop to finish when the test finds body_text == nil, leaving me with a
collection of URLs which I know to definitely contain body text.
After a lot of playing around, I have got this far, but there is no looping
going on. I am getting the page okay and am testing for a certain condition
which results in "Empty!" being appended to the array (essentially, when
body_text == nil). But I can't work out how to loop.
def get_text(base_url, page_number)
@target_url = base_url + page_number.to_s
@noko_doc = Nokogiri::HTML(open(@target_url))
@text = ''
@noko_doc.css('div.body_recap').each do |text|
@text << text.content
@text = @text.strip!
return @text
end
end
def collect_urls(base_url, page_number)
@valid_urls = []
text = get_text(base_url, page_number)
if text =~ /\A\s*Previous/
@valid_urls << "END!"
else @valid_urls << @target_url
return @valid_urls
end
end
end
···
--
Any help or comments very welcome!
Thanks all.
Matt