Newbie: how to find & extract a string from a file

Esmail_Bonakdarian · 29 September 2006 22:45

Hi,

Just starting out to explore Ruby (I like it) and I have
a question.

I have an HTML file that contains several references to jpg files.

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

It appears each .jpg reference is on its own line.

Thanks!

Esmail

Christopher_Aldridge · 30 September 2006 00:11

I'm sure someone will have a better way of doing this.. but..
Assuming it has <img src="/something.jpg">

···

##
imgs =
IO.readlines("c:/somefile.html").each {|line| imgs << line.split("<img
src=\"")[1].to_s.split("\"")[0] if line.match("<img src=") }
puts imgs.join("\n")
##

On 9/29/06, Esmail Bonakdarian <ebonak-a@t-hotmail.com> wrote:

Hi,

Just starting out to explore Ruby (I like it) and I have
a question.

I have an HTML file that contains several references to jpg files.

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

It appears each .jpg reference is on its own line.

Thanks!

Esmail

Jordan_Callicoat · 30 September 2006 00:50

Esmail Bonakdarian wrote:

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

Hi there,

You could use raw regexps and do it yourself, but you should probably
use an HTML parser to extract HTML data.

A nice HTML parser is Hpricot [1], but it requires an extension (you
cen get it very easily via gems, see the link below). It is very easy
to use, and it's fast.

Using Hpricot, you can do something like this:

require 'hpricot'
require 'open-uri'
soc = open('http://utopia.utexas.edu/maps/ireland.html'\)
doc = Hpricot(soc)
soc.close
doc.search('//a').each { |elem|
  href = elem.attributes['href']
  if not href.nil? and
     ['.jpg', '.jpeg'].include?(File.extname(href))
    puts href
  end
}

Note that you can also use the built-in REXML parser [2], and do
something like:

require 'rexml/document'
require 'open-uri'
include REXML
soc = open('http://utopia.utexas.edu/maps/ireland.html'\)
doc = Document.new(soc)
soc.close
doc.elements.each('//a') { |elem|
  href = elem.attributes['href']
  if not href.nil? and
     ['.jpg', '.jpeg'].include?(File.extname(href))
    puts href
  end
}

[1] http://code.whytheluckystiff.net/hpricot/
[2] http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan

Esmail_Bonakdarian · 30 September 2006 02:40

x1 wrote:

I'm sure someone will have a better way of doing this.. but..
Assuming it has <img src="/something.jpg">

##
imgs =
IO.readlines("c:/somefile.html").each {|line| imgs << line.split("<img
src=\"")[1].to_s.split("\"")[0] if line.match("<img src=") }
puts imgs.join("\n")
##

Hi,

thanks .. this will get me started. I feel like I could do this
using various unix tools (grep/awk), but I'm trying to learn
Ruby ...

Esmail

Esmail_Bonakdarian · 30 September 2006 02:45

Hi,

Thank you so much for these pointers. Am I correct in assuming
that REXML comes as part of standard Ruby? If so I think I will
go that route first.

I could cobble something together using various Linux tools
(grep and awk come to mind), but I want something in Ruby
(because I want to learn it) and also because it will be
more portable, for instance to the XP platform.

I appreciate you taking the time to post this and the references.
If you have any other ideas/approaches, I'm game.

Thanks again,

Esmail

MonkeeSage wrote:

···

Esmail Bonakdarian wrote:

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

Hi there,

You could use raw regexps and do it yourself, but you should probably
use an HTML parser to extract HTML data.

A nice HTML parser is Hpricot [1], but it requires an extension (you
cen get it very easily via gems, see the link below). It is very easy
to use, and it's fast.

Using Hpricot, you can do something like this:

require 'hpricot'
require 'open-uri'
soc = open('http://utopia.utexas.edu/maps/ireland.html'\)
doc = Hpricot(soc)
soc.close
doc.search('//a').each { |elem|
  href = elem.attributes['href']
  if not href.nil? and
     ['.jpg', '.jpeg'].include?(File.extname(href))
    puts href
  end
}

Note that you can also use the built-in REXML parser [2], and do
something like:

require 'rexml/document'
require 'open-uri'
include REXML
soc = open('http://utopia.utexas.edu/maps/ireland.html'\)
doc = Document.new(soc)
soc.close
doc.elements.each('//a') { |elem|
  href = elem.attributes['href']
  if not href.nil? and
     ['.jpg', '.jpeg'].include?(File.extname(href))
    puts href
  end
}

[1] http://code.whytheluckystiff.net/hpricot/
[2] http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan

Jordan_Callicoat · 30 September 2006 05:20

Esmail Bonakdarian wrote:

Thank you so much for these pointers. Am I correct in assuming
that REXML comes as part of standard Ruby? If so I think I will
go that route first.

Glad to help. And yes, REXML is a pure-ruby parser (uses regexp under
the hood) and is included with ruby stdlib since 1.8.

I could cobble something together using various Linux tools
(grep and awk come to mind), but I want something in Ruby
(because I want to learn it) and also because it will be
more portable, for instance to the XP platform.

REXML is more portible, albeit not as fast as Hpricot, which is
implemented as a compiled C extension for ruby.

Have fun learning ruby! It's a nice language.

Regards,
Jordan

Topic		Replies	Views
Regular Expressions ruby-talk	1	95	27 August 2008
Is there link extractor or similar html processing libs for Ruby ruby-talk	16	141	10 March 2006
Confusion trying to get IMG tags from html page ruby-talk	7	119	30 July 2005
Extract/Parse String? ruby-talk	11	101	7 July 2005
Still Query Continues ruby-talk	3	87	28 August 2008

Newbie: how to find & extract a string from a file

Related topics