Populate an array based on object

Hi,

I have been trying for some time to crack a problem which I am sure if very
simple! I am learning Ruby and programming in general and having lots of
fun. But help would be appreciated...

I am writing a program to scrape body text from a series of web pages so
that they can be presented in a text file.

The format for the URLs of the series of pages I am interested in is:

www.targeturl.com/episode_x?=page1
...
...
www.targeturl.com/episode_x?=page[y]

Basically, "episode_x" has y number of pages, starting at 1.

I am using Nokogiri to grab the text from the page and can quite easily get
the text from page1, but I want to loop through page2, grab its text, page3,
grab its text, etc, until I reach page[y] which is where the text ends, and
to Nokogiri - this means there is no more text on that page (i.e. body_text
== nil).

Before attempting to grab the body text and append to a text file, my
strategy is to populate an array of 'valid' urls, based on a test which
involves Nokogiri finding text in the body tag, starting at page1. I want
the loop to finish when the test finds body_text == nil, leaving me with a
collection of URLs which I know to definitely contain body text.

After a lot of playing around, I have got this far, but there is no looping
going on. I am getting the page okay and am testing for a certain condition
which results in "Empty!" being appended to the array (essentially, when
body_text == nil). But I can't work out how to loop.

def get_text(base_url, page_number)

    @target_url = base_url + page_number.to_s
    @noko_doc = Nokogiri::HTML(open(@target_url))

    @text = ''
    @noko_doc.css('div.body_recap').each do |text|
       @text << text.content
       @text = @text.strip!
       return @text
     end
  end

def collect_urls(base_url, page_number)
  @valid_urls = []
  text = get_text(base_url, page_number)
  if text =~ /\A\s*Previous/
    @valid_urls << "END!"
  else @valid_urls << @target_url
    return @valid_urls
  end
end
end

···

--

Any help or comments very welcome!

Thanks all.

Matt

M R Lemon wrote in post #1023844:

But I can't work out how to loop.

def get_text(base_url, page_number)

    @target_url = base_url + page_number.to_s
    @noko_doc = Nokogiri::HTML(open(@target_url))

    @text = ''
    @noko_doc.css('div.body_recap').each do |text|
       @text << text.content
       @text = @text.strip!
       return @text
     end
  end

Aside: there is no need to use instance variables (e.g. @target_url)
here. Local variables would be fine (e.g. target_url).

Instance variables persist within the object instance even when they
return (e.g. another method could see what was assigned to @target_url
previously). This means that you're holding on to references which will
prevent these temporary objects from being garbage-collected.

def collect_urls(base_url, page_number)
  @valid_urls =
  text = get_text(base_url, page_number)
  if text =~ /\A\s*Previous/
    @valid_urls << "END!"
  else @valid_urls << @target_url
    return @valid_urls
  end
end
end

Well, I see no loop in there. Also, the formatting is a bit odd, which
makes it hard to read. What you've actually written is this:

  if text =~ /\A\s*Previous/
    @valid_urls << "END!"
  else
    @valid_urls << @target_url
    return @valid_urls
  end

So I'm not sure exactly what you're trying to achieve, and which of
these two conditions is supposed to be the loop termination, but you
could try this as a starting point:

def collect_urls(base_url, page_number)
  valid_urls =
  1000.times do # prevent infinite looping
    text = get_text(base_url, page_number)
    if text =~ /\A\s*Previous/
      valid_urls << "END!"
      break
    else
      valid_urls << @target_url
      page_number += 1
    end
  end
  return valid_urls
end

Incidentally, you seem to be relying on the instance variable
@target_url being set after get_text being called. It would be cleaner
to return the url and the text as two values.

    return target_url, text

    ...

    target_url, text = get_text(base_url, page_number)

HTH,

Brian.

···

--
Posted via http://www.ruby-forum.com/\.

Brian,

Many thanks for your helpful suggestions. This is working for me now.
Originally, I was trying to avoid 'hard-coding' the loop such as
'1000.times do' but I see now that this is probably the most elegant
solution. Two really good learning points there for me two on the use of
instance variables and efficiently returning and referring to expressions.
Thanks again!

Matt

···

On 26 September 2011 20:22, Brian Candler <b.candler@pobox.com> wrote:

> def collect_urls(base_url, page_number)
> valid_urls =
> 1000.times do # prevent infinite looping
> text = get_text(base_url, page_number)
> if text =~ /\A\s*Previous/
> valid_urls << "END!"
> break
> else
> valid_urls << @target_url
> page_number += 1
> end
> end
> return valid_urls
> end

Incidentally, you seem to be relying on the instance variable
@target_url being set after get_text being called. It would be cleaner
to return the url and the text as two values.

   return target_url, text

   ...

   target_url, text = get_text(base_url, page_number)