Q: download html file and parse it


#1

Hi,

I want to do …
…1) connect http server and get a html file
…2) parse a http file to retrieve infromation from it

I want to automate my routine work.

Could you recommend me a good library or solution?

regards
kwatch


(Philip Mak) #2

…1) connect http server and get a html file

You can use Net::HTTP. Some documentation for it can be found here:

http://www.rubycentral.com/book/lib_network.html

Search for “Net::HTTP” in that page (it’s about halfway down).

…2) parse a http file to retrieve infromation from it

Here’s a Ruby module that parses HTML. There may be others (look in
the Ruby Application Archive):

http://www.ruby-lang.org/en/raa-list.rhtml?name=html-parser

I want to automate my routine work.

Assuming you’re on a UNIX system, make a cron job that periodically
runs your ruby script.

···

On Tue, Jun 25, 2002 at 04:32:50PM +0900, kwatch wrote:


(Park Heesob) #3

“kwatch” kwatch@lycos.jp wrote in message
news:cf674456.0206242322.638840b5@posting.google.com

I want to do …
.1) connect http server and get a html file

I like PHP’s “URL fopen wrapper” feature.
Here is my Ruby draft implementation:

···

class RFile
require 'net/ftp’
require 'net/http’
require 'uri’
require ‘tempfile’

def initialize(uri,net)
@uri = uri
@net = net
end

def RFile.open(fileName,aMode=“r”,aPerm=nil,&block)
uri = URI.parse(fileName)
case uri.scheme
when nil
aFile = aPerm ? File.new(fileName ,aMode,aPerm) :
File.new(fileName,aMode)
if block_given?
yield aFile
aFile.close
aFile = nil
end
aFile
when ‘http’
Net::HTTP.version_1_1
new(uri,Net::HTTP.new(uri.host,uri.port))
when ‘ftp’
user,pass = uri.userinfo.split(’:’) if uri.userinfo
new(uri,Net::FTP.new(uri.host,user,pass))
end
end

def read
case @uri.scheme
when ‘http’
@net.get(@uri.path)[1]
when ‘ftp’
dir,file = File.split(@uri.path)
@net.chdir(dir[1…-1]) if dir!=’/'
data = ‘’
@net.retrbinary(“RETR #{file}”, 1024){|d| data += d}
data
end
end

def each(aSepString=$/, &block)
read.split(aSepString).each(&block)
end

def write(data)
case @uri.scheme
when ‘ftp’
dir,file = File.split(@uri.path)
@net.chdir(dir[1…-1]) if dir!=’/'
f = Tempfile.new(“tmp”)
f.write(data)
f.close
@net.putbinaryfile( f.path, file, 1024 )
f.close(true)
data.length
end
end

def close
@net.close if @uri.schme=='ftp’
end

end

usage:

RFile:open("/home/path/file.txt", “r”).each {|x| puts x}
RFile:open(“http://www.example.com/index.htm”, “r”).each {|x| puts x}
RFile:open(“ftp://user:password@example.com/file”, “r”).read
RFile:open(“ftp://user:password@example.com/file”, “w”).write(“test”)

Park Heesob.


(Shashank Date) #4

I had to modify the html-parser a little to get it to work.
For example:

···

#! /usr/local/bin/ruby
require "net/http"
require "html-parser"
require “formatter”

def htmltest(data)
w = DumbWriter.new
f = AbstractFormatter.new(w)
p = HTMLParser.new(f)
p.feed(data)
p.close
end

domain = 'www.rubycentral.com
file = ‘/book/rubyworld.html’

h = Net::HTTP.new(domain, 80)
resp, data = h.get(file, nil )
puts domain + file if $DEBUG

htmltest(data)

This program generated the following error:

c:/ruby/lib/ruby/site_ruby/html-parser.rb:409:in Integer': invalid value for Integer: ""1"" (ArgumentError) from c:/ruby/lib/ruby/site_ruby/html-parser.rb:409:indo_img’
from c:/ruby/lib/ruby/site_ruby/html-parser.rb:395:in each' from c:/ruby/lib/ruby/site_ruby/html-parser.rb:395:indo_img’
from c:/ruby/lib/ruby/site_ruby/sgml-parser.rb:281:in send' from c:/ruby/lib/ruby/site_ruby/sgml-parser.rb:281:inhandle_starttag’
from c:/ruby/lib/ruby/site_ruby/sgml-parser.rb:233:in
finish_starttag' from c:/ruby/lib/ruby/site_ruby/sgml-parser.rb:208:inparse_starttag’
from c:/ruby/lib/ruby/site_ruby/sgml-parser.rb:89:in goahead' from c:/ruby/lib/ruby/site_ruby/sgml-parser.rb:58:infeed’
from htmltest00.rb:10:in `htmltest’
from htmltest00.rb:21

To fix it, I modified “do_img” method in file (at line 409) html-parser.rb
where it was:

  if attrname == 'width'
    width = Integer(value)
  end
  if attrname == 'height'
    height = Integer(value)
  end

changed to

  if attrname == 'width'
    width = Integer(value.gsub(/[\'\"/,''))  # replace all double-quotes

" and single quotes ’ with nothing
end
if attrname == ‘height’
height = Integer(value.gsub(/[’"/,’’))
end

And then it worked.

I am not sure if this is the best way to do it :wink: but i thought I should
share it with you.

Also, here are some changes I did to the sgml-parser.rb at line 57:

def feed(data)
@rawdata << data
goahead(false)
end

changed to :

def feed(data)
@rawdata << data if data # make sure that data is not nil
goahead(false)
end

HTH,
– Shanko

“Philip Mak” pmak@animeglobe.com wrote in message
news:20020625073957.GR9237@trapezoid.interserver.net

On Tue, Jun 25, 2002 at 04:32:50PM +0900, kwatch wrote:

…1) connect http server and get a html file

You can use Net::HTTP. Some documentation for it can be found here:

http://www.rubycentral.com/book/lib_network.html

Search for “Net::HTTP” in that page (it’s about halfway down).

…2) parse a http file to retrieve infromation from it

Here’s a Ruby module that parses HTML. There may be others (look in
the Ruby Application Archive):

http://www.ruby-lang.org/en/raa-list.rhtml?name=html-parser

I want to automate my routine work.

Assuming you’re on a UNIX system, make a cron job that periodically
runs your ruby script.


(Aidan) #5

I suggest Ned Konz’s html parser, available from RAA. It can return a
REXML tree object which lets you treat the page as if it had been an
XML document.

Aidan

···

On Tue, Jun 25, 2002 at 04:32:50PM +0900, kwatch wrote:

…2) parse a http file to retrieve infromation from it