Newbie read.scan (?) question

Hi,

I'm trying to get my feet wet with Ruby by tackling a manageable, but
real, issue I'd like to solve.

I'm an academic, and subscribe to some RSS feeds of journals I read.
However, the feeds are really bad, and only contain lists of authors
and titles (with no markup), and links to the issue urls.

So, I want a script that takes those feeds, goes to the issue pages,
grabs the links for the articles, and then from there extracts author
and title information.

For some reason I don't understand, the below fragment all works,
except for the author attribute is always blank. The problem is not
with my regular expression pattern.

Could someone explain what I'm doing wrong?

Bruce

# journals is an array of rss feed urls and titles
journals.each do |journal|
  open(journal[1]) do |http|
    response = http.read
    result = RSS::Parser.parse(response, false)

  # grab first issue url listed from each journal
    issue_url = result.items[0].link

  # regular expression patterns to use below
    article_page = /<a href="(.*?)">Article Description<\/a>/
    title_match = /<span class="article-title">(.*?)<\/span>/
    author_match = /<strong>Author:<\/strong><\/td><td
class="rightcol">(.*?)</

    articles = open(issue_url)
    # find each article url by screen-scraping
    articles.read.scan(article_page).each do |url|
      article_url = "#{base_url}#{url}"
      open(article_url) do |article|
      # screen-scrap for article author and title
        title = article.read.scan(title_match)
      # for whatever reason, author never returns anything
        author = article.read.scan(author_match)
      # create new article object
        list.append(Article.new(title, author, article_url))
      end
    end
  end
end

Bruce D'Arcus schrieb:

For some reason I don't understand, the below fragment all works,
except for the author attribute is always blank. The problem is not
with my regular expression pattern.

Could someone explain what I'm doing wrong?

Hi Bruce,

I don't know which libraries you're using, but could it be that you can only read once from article, like reading from a file?

Instead of

      open(article_url) do |article|
      # screen-scrap for article author and title
        title = article.read.scan(title_match)
      # for whatever reason, author never returns anything
        author = article.read.scan(author_match)

try something like

   open(article_url) do |article|
   # screen-scrap for article author and title
     article_text = article.read
     title = article_text.scan(title_match)
     author = article_text.scan(author_match)

HTH

Regards,
Pit

article is a stream and you try to read it twice, this doesn't work like you think. I guess the 2nd article.read just returns "", so "".scan(...) returns nothing.
Try the following:

    articles.read.scan(article_page).each do |url|
      article_url = "#{base_url}#{url}"
      open(article_url) do |article|

          articletxt=article.read

      # screen-scrap for article author and title

          title = articletxt.scan(title_match)

      # for whatever reason, author never returns anything

          author = articletxt.scan(author_match)

      # create new article object
        list.append(Article.new(title, author, article_url))
      end
    end

Dominik

Yes, that solved the problem. I had a feeling it was something pretty
simple.

Thanks!

Bruce

One followup.

Why if I dump my list of article objects to YAML, do I end up with
this:

- !ruby/object:Article
  author:

···

-
      - "Hovorka, Alice J."
  title:
    -
      - "The (Re) Production of Gendered Positionality in Botswana's
Commercial Urban
        Agriculture Sector"
  url:
http://journals.ohiolink.edu/cgi-bin/sciserv.pl?collection=journals&journal=00045608

I'm referring to the fact that article and title content aren't
represented the same as url (which is what I was expecting).

I have these two classes:

class Article

  include Journals

  attr_reader :title, :author, :description, :url
  def initialize(title, author, url)
    @title = title
    @author = author
    @url = url
  end

  def to_s
    "#@title, #@author"
  end

  def abstract
  #
  end

  def refer
    Journals::const_get(:BASE_URL) + "/" +
    @url + "&form=refer&file=file.txt"
  end

  def pdf
    Journals::const_get(:BASE_URL) + "/" +
    @url + "&form=pdf&file=file.pdf"
  end
end

class Articles
#
  attr_reader :articles

  def initialize
    @articles = Array.new
  end

  def append(article)
    @articles.push(article)
    self
  end

  def [](index)
    @articles[index]
  end
end

.... and then:

list = Articles.new

... and at the end:

File.open("articles.yaml", "w") {|f| YAML.dump(list.articles, f)}

Or is everything fine?

Bruce

Hi,

Bruce D'Arcus a écrit :

Why if I dump my list of article objects to YAML, do I end up with
this:

- !ruby/object:Article
  author:
    -
      - "Hovorka, Alice J."
  title:
    -
      - "The (Re) Production of Gendered Positionality in Botswana's
Commercial Urban
        Agriculture Sector"
  url:
http://journals.ohiolink.edu/cgi-bin/sciserv.pl?collection=journals&amp;journal=00045608

I'm referring to the fact that article and title content aren't
represented the same as url (which is what I was expecting).

Because your author and title probably aren't strings as you expect them to be but rather arrays. You should try to puts @title.inspect somewhere to see what it is.

I have these two classes:

class Article

  include Journals

  attr_reader :title, :author, :description, :url
  def initialize(title, author, url)
    @title = title
    @author = author
    @url = url
  end

  def to_s
    "#@title, #@author"
  end

  def abstract
  #
  end

  def refer
    Journals::const_get(:BASE_URL) + "/" +
    @url + "&form=refer&file=file.txt"
  end

  def pdf
    Journals::const_get(:BASE_URL) + "/" +
    @url + "&form=pdf&file=file.pdf"
  end
end

class Articles
#
  attr_reader :articles

  def initialize
    @articles = Array.new
  end

  def append(article)
    @articles.push(article)
    self
  end

  def (index)
    @articles[index]
  end
end

Why create an Article class and an Articles class? You could make all the content of your Articles class also content of the Article class but at the class level instead of the instance level. So you just have to transform your @articles variable into @@articles and define your append and methods as self.append and self..

An other thing: I don't think you need to use Journals::const_get(:BASE_URL). You could simply use Journals::BASE_URL.

HTH

Ghislain

Ghislain Mary wrote:

Because your author and title probably aren't strings as you expect them
to be but rather arrays.

Ah, right. Using scan returns an array. On this ...

> I have these two classes:
>
> class Article
>
> include Journals
>
> attr_reader :title, :author, :description, :url
> def initialize(title, author, url)
> @title = title
> @author = author
> @url = url
> end
>
> def to_s
> "#@title, #@author"
> end
>
> def abstract
> #
> end
>
> def refer
> Journals::const_get(:BASE_URL) + "/" +
> @url + "&form=refer&file=file.txt"
> end
>
> def pdf
> Journals::const_get(:BASE_URL) + "/" +
> @url + "&form=pdf&file=file.pdf"
> end
> end
>
> class Articles
> #
> attr_reader :articles
>
> def initialize
> @articles = Array.new
> end
>
> def append(article)
> @articles.push(article)
> self
> end
>
> def (index)
> @articles[index]
> end
> end

Why create an Article class and an Articles class?

Because I'm *real* newbie! My only programming background is with
XSLT. So I'm trying to also understand basic OO design in this
example.

You could make all
the content of your Articles class also content of the Article class but
at the class level instead of the instance level. So you just have to
transform your @articles variable into @@articles and define your append
and methods as self.append and self..

Can you give me an abbreviated example of how to do actually do this?
For example, how do I define @@articles under the Article class, and
how would I then define the append method there.

An other thing: I don't think you need to use
Journals::const_get(:BASE_URL). You could simply use Journals::BASE_URL.

Ah thanks. It took me awhile just to get that far!

Bruce

Bruce D'Arcus a écrit :

Why create an Article class and an Articles class?

Because I'm *real* newbie! My only programming background is with
XSLT. So I'm trying to also understand basic OO design in this
example.

So welcome into the Ruby community :wink:
I'm still considering myself as a newby too, and I don't often reply to posts on this list because I often think I am not able to contribute in a good way to the discussions. But I learn a lot by reading what is happening here :slight_smile:

Can you give me an abbreviated example of how to do actually do this?
For example, how do I define @@articles under the Article class, and
how would I then define the append method there.

You could do something like:

class Article

   include Journals

   attr_reader :title, :author, :description, :url

   # Create the Array containing the articles.
   @@articles = Array.new

   def initialize(title, author, url)
     @title, @author, @url = title, author, url

     # Add the new Article to the articles array.
     @@articles << self
   end

   def to_s
     "#@title, #@author"
   end

   def refer
     Journals::BASE_URL + "/" + @url + "&form=refer&file=file.txt"
   end

   def pdf
     Journals::BASE_URL + "/" + @url + "&form=pdf&file=file.pdf"
   end

   # Add a class method to get an Article by its index in the @@articles Array.
   def self.(index)
     @@articles[index]
   end

   # Add a method to get the number of articles.
   # Call it how you want it to be called.
   def self.count
     @@articles.size
   end

end

Good luck,

Ghislain

Oh... I was forgetting.

You don't even need an append method anymore since when you create a new Article it is automatically pushed into the @@articles Array.

Ghislain

I have not followed this thread in depth, but I think it is a good
idea to distinguish between a set of articles and an article. I don't
see how you would benefit from mixing these two. If I understand the
proposal correctly, you would no longer be able to maintain two
independent sets of articles, because the ArticleSet would be part of
the article class.

Anyhow, here is how to define a class variable and class methods.

class Klass
  @@foo =

  def self.add(bar)
    @@foo << bar
  end

  def self.foo
    @@foo
  end
end

Klass.add(1)
Klass.add(2)
p Klass.foo

good luck with ruby,

Brian

···

On 06/06/05, Bruce D'Arcus <bdarcus.lists@gmail.com> wrote:

Ghislain Mary wrote:

> Because your author and title probably aren't strings as you expect them
> to be but rather arrays.

Ah, right. Using scan returns an array. On this ...

> > I have these two classes:
> >
> > class Article
> >
> > include Journals
> >
> > attr_reader :title, :author, :description, :url
> > def initialize(title, author, url)
> > @title = title
> > @author = author
> > @url = url
> > end
> >
> > def to_s
> > "#@title, #@author"
> > end
> >
> > def abstract
> > #
> > end
> >
> > def refer
> > Journals::const_get(:BASE_URL) + "/" +
> > @url + "&form=refer&file=file.txt"
> > end
> >
> > def pdf
> > Journals::const_get(:BASE_URL) + "/" +
> > @url + "&form=pdf&file=file.pdf"
> > end
> > end
> >
> > class Articles
> > #
> > attr_reader :articles
> >
> > def initialize
> > @articles = Array.new
> > end
> >
> > def append(article)
> > @articles.push(article)
> > self
> > end
> >
> > def (index)
> > @articles[index]
> > end
> > end
>
> Why create an Article class and an Articles class?

Because I'm *real* newbie! My only programming background is with
XSLT. So I'm trying to also understand basic OO design in this
example.

> You could make all
> the content of your Articles class also content of the Article class but
> at the class level instead of the instance level. So you just have to
> transform your @articles variable into @@articles and define your append
> and methods as self.append and self..

Can you give me an abbreviated example of how to do actually do this?
For example, how do I define @@articles under the Article class, and
how would I then define the append method there.

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

OK, thanks!

And now how do I then access the @@articles array? If before I had:

list = Articles.new

... what would be the equivalent here?

Bruce

Brian Schröder wrote:

Anyhow, here is how to define a class variable and class methods.

class Klass
  @@foo =

  def self.add(bar)
    @@foo << bar
  end

  def self.foo
    @@foo
  end
end

Klass.add(1)
Klass.add(2)
p Klass.foo

OK, am struggling with translating this to my example. Here's what
I've done:

    articles.read.scan(article_page).each do |url|
      article_url = "#{base_url}#{url}"
      open(article_url) do |article|
        article_text = article.read
        title = article_text.scan(title_match).to_s
        author = article_text.scan(author_match).to_s
        puts "loading #{title} ...\n"
        a = Article.new(title, author, article_url)
        a.add
      end

.... and then:

File.open("articles.yaml", "w") {|f| YAML.dump(p Article.articles, f)}

But I get a "undefined method `add'" error. I have that part of the
class defined like so:

class Article
  include Journals

  @@articles =

  attr_reader :title, :author, :url

  def initialize(title, author, url)
    @title = title
    @author = author
    @url = url
  end

  def self.add(article)
    @@articles << article
  end

  def self.articles
    @@article
  end

  ...

good luck with ruby,

Thanks!

Bruce

Brian Schröder wrote:

I have not followed this thread in depth, but I think it is a good
idea to distinguish between a set of articles and an article. I don't
see how you would benefit from mixing these two. If I understand the
proposal correctly, you would no longer be able to maintain two
independent sets of articles, because the ArticleSet would be part of
the article class.

And actually, I guess the bigger question is how you would deal with
this then? Are you saying I was on the right track originally with my
Articles class? Or would there be some other approach?

Bruce

Bruce D'Arcus a écrit :

OK, thanks!

And now how do I then access the @@articles array? If before I had:

list = Articles.new

... what would be the equivalent here?

You can define the following:

class Article

   def self.articles
     @@articles
   end

end

But in fact, as Brian said, this may not be a good idea to store the articles in the Article class. This depends on the fact whether you want to be able to store several groups of articles or only one. I hadn't think of it because of the way you asked it. I undestood that you were only handling one group of articles, but maybe that's not the case. However, it's a good situation to learn a little about class variables and class methods :wink:

Ghislain

Brian Schröder wrote:

Anyhow, here is how to define a class variable and class methods.

class Klass
  @@foo =

  def self.add(bar)
    @@foo << bar
  end

  def self.foo
    @@foo
  end
end

Klass.add(1)
Klass.add(2)
p Klass.foo

OK, am struggling with translating this to my example. Here's what
I've done:

   articles.read.scan(article_page).each do |url|
     article_url = "#{base_url}#{url}"
     open(article_url) do |article|
       article_text = article.read
       title = article_text.scan(title_match).to_s
       author = article_text.scan(author_match).to_s
       puts "loading #{title} ...\n"
       a = Article.new(title, author, article_url)
       a.add
     end

add is a class method (see the definition of self.add, which is the
same as saying Article.add), so you would want to call it like

Article.add a # Need to pass the new article in.

.... and then:

File.open("articles.yaml", "w") {|f| YAML.dump(p Article.articles, f)}

But I get a "undefined method `add'" error. I have that part of the
class defined like so:

class Article
include Journals

@@articles =

attr_reader :title, :author, :url

def initialize(title, author, url)
   @title = title
   @author = author
   @url = url
end

def self.add(article)
   @@articles << article
end

def self.articles
   @@article
end

...

good luck with ruby,

Thanks!

Bruce

E

···

Le 6/6/2005, "Bruce D'Arcus" <bdarcus.lists@gmail.com> a écrit:

--
template<typename duck>
void quack(duck& d) { d.quack(); }

Yes, I'd say you were on the right track. Even if you by now only use
one set of articles (You called this class Articles) I'd say it is
cleaner to have an explicit class and its more extensible than having
the Article class contain all its instances.

regards,

Brian

···

On 07/06/05, Bruce D'Arcus <bdarcus.lists@gmail.com> wrote:

Brian Schröder wrote:

> I have not followed this thread in depth, but I think it is a good
> idea to distinguish between a set of articles and an article. I don't
> see how you would benefit from mixing these two. If I understand the
> proposal correctly, you would no longer be able to maintain two
> independent sets of articles, because the ArticleSet would be part of
> the article class.

And actually, I guess the bigger question is how you would deal with
this then? Are you saying I was on the right track originally with my
Articles class? Or would there be some other approach?

Bruce

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/