Thread and HTTP troubles

I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
  siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
  begin
    masterSite = Net::HTTP.new(siteHost,80)
    siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
    resp, data = masterSite.get2(siteURL, nil)
    return data
  rescue
    return "-999"
  end
end

Sorry about the two for one question :stuck_out_tongue:

Thanks!

"Keegan Dunn" <theweeg@gmail.com> schrieb im Newsbeitrag news:65e6c89204121310527b234a7b@mail.gmail.com...

I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

It's most likely the most efficient way. You need these ingredients:

- a thread safe queue
- a pool of processors
- a main thread that does the distribution of work

You also likely want to have a class or method that deals with the details of fetching data and analysing / storing it to keep thread body blocks small.

# untested but you'll get the picture
require 'thread'

THREADS = 10
TERM = Object.new
queue = Queue.new
threads =

THREADS.times do
  threads << Thread.new( queue ) do |q|
    until ( TERM == ( url = q.deq ) )
      begin
        # get data from url
      rescue
        # in case of timeout try again by putting
        # it back
      end
    end
  end
end

# now read urls and distribute work
while ( line = gets )
  line.chomp!
  queue.enq line
end

# write terminators
THREADS.times { queue.enq TERM }

# ... and wait for threads to terminate properly
threads.each {|t| t.join}

# exiting

Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
begin
masterSite = Net::HTTP.new(siteHost,80)
siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
resp, data = masterSite.get2(siteURL, nil)
return data
rescue
return "-999"
end

You'll likely need to catch another exception. Try "rescue Exception => e" and then print e's class.

Sorry about the two for one question :stuck_out_tongue:

You get one answer for free. :slight_smile:

Kind regards

    robert

You'll also want to include 'resolv-replace'. Otherwise all of your
threads will block whenever any thread does a name lookup. Hopefully
this wont be needed once Rite gets here...

Leslie Hensley

···

On Tue, 14 Dec 2004 04:57:20 +0900, Robert Klemme <bob.news@gmx.net> wrote:

"Keegan Dunn" <theweeg@gmail.com> schrieb im Newsbeitrag
news:65e6c89204121310527b234a7b@mail.gmail.com...
> I'm trying to write a threaded program that will run through a list of
> web sites and download/process a set number of them at a
> time(maintaining a pool of threads that can process page
> downloads/processing). I have something simple working, but I am
> unsure how to approach the "pool" of threads idea. Is that even the
> way to go about processing multiple pages simultaneously? Is there a
> better way?

It's most likely the most efficient way. You need these ingredients:

- a thread safe queue
- a pool of processors
- a main thread that does the distribution of work

You also likely want to have a class or method that deals with the details
of fetching data and analysing / storing it to keep thread body blocks
small.

# untested but you'll get the picture
require 'thread'

THREADS = 10
TERM = Object.new
queue = Queue.new
threads =

THREADS.times do
  threads << Thread.new( queue ) do |q|
    until ( TERM == ( url = q.deq ) )
      begin
        # get data from url
      rescue
        # in case of timeout try again by putting
        # it back
      end
    end
  end
end

# now read urls and distribute work
while ( line = gets )
  line.chomp!
  queue.enq line
end

# write terminators
THREADS.times { queue.enq TERM }

# ... and wait for threads to terminate properly
threads.each {|t| t.join}

# exiting

> Also, how can I deal with a "socket read timeout" error? I have the
> http get call wrapped in a begin...rescue...end block, but it doesn't
> seem to be catching it. Here is the code in question:
>
> def getHTTP(site)
> siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
> begin
> masterSite = Net::HTTP.new(siteHost,80)
> siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
> resp, data = masterSite.get2(siteURL, nil)
> return data
> rescue
> return "-999"
> end
> end

You'll likely need to catch another exception. Try "rescue Exception => e"
and then print e's class.

> Sorry about the two for one question :stuck_out_tongue:

You get one answer for free. :slight_smile:

Kind regards

    robert

Robert Klemme said:

You'll likely need to catch another exception. Try "rescue Exception =>
e" and then print e's class.

The error in question is Timeout::Error which inherits from Interrupt
which in turn inherits from SignalException. Since a plain vanilla rescue
clause will only rescue exceptions deriving from StandardError (and
SignalException is not derived from StandardError), it won't pick up this
exception.

If you use

  begin
    # stuff
  rescue Timeout::Error => ex
    # handle timeout
  end

you should be ok.

···

--
-- Jim Weirich jim@weirichhouse.org http://onestepback.org
-----------------------------------------------------------------
"Beware of bugs in the above code; I have only proved it correct,
not tried it." -- Donald Knuth (in a memo to Peter van Emde Boas)

I noticed the threads were doing that. I meant to ask about that as
well. Thank you for the help, Leslie and Robert.

···

On Tue, 14 Dec 2004 05:27:19 +0900, Leslie Hensley <hensleyl@gmail.com> wrote:

You'll also want to include 'resolv-replace'. Otherwise all of your
threads will block whenever any thread does a name lookup. Hopefully
this wont be needed once Rite gets here...

Leslie Hensley

On Tue, 14 Dec 2004 04:57:20 +0900, Robert Klemme <bob.news@gmx.net> wrote:
>
> "Keegan Dunn" <theweeg@gmail.com> schrieb im Newsbeitrag
> news:65e6c89204121310527b234a7b@mail.gmail.com...
> > I'm trying to write a threaded program that will run through a list of
> > web sites and download/process a set number of them at a
> > time(maintaining a pool of threads that can process page
> > downloads/processing). I have something simple working, but I am
> > unsure how to approach the "pool" of threads idea. Is that even the
> > way to go about processing multiple pages simultaneously? Is there a
> > better way?
>
> It's most likely the most efficient way. You need these ingredients:
>
> - a thread safe queue
> - a pool of processors
> - a main thread that does the distribution of work
>
> You also likely want to have a class or method that deals with the details
> of fetching data and analysing / storing it to keep thread body blocks
> small.
>
> # untested but you'll get the picture
> require 'thread'
>
> THREADS = 10
> TERM = Object.new
> queue = Queue.new
> threads =
>
> THREADS.times do
> threads << Thread.new( queue ) do |q|
> until ( TERM == ( url = q.deq ) )
> begin
> # get data from url
> rescue
> # in case of timeout try again by putting
> # it back
> end
> end
> end
> end
>
> # now read urls and distribute work
> while ( line = gets )
> line.chomp!
> queue.enq line
> end
>
> # write terminators
> THREADS.times { queue.enq TERM }
>
> # ... and wait for threads to terminate properly
> threads.each {|t| t.join}
>
> # exiting
>
>
>
> > Also, how can I deal with a "socket read timeout" error? I have the
> > http get call wrapped in a begin...rescue...end block, but it doesn't
> > seem to be catching it. Here is the code in question:
> >
> > def getHTTP(site)
> > siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
> > begin
> > masterSite = Net::HTTP.new(siteHost,80)
> > siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
> > resp, data = masterSite.get2(siteURL, nil)
> > return data
> > rescue
> > return "-999"
> > end
> > end
>
> You'll likely need to catch another exception. Try "rescue Exception => e"
> and then print e's class.
>
> > Sorry about the two for one question :stuck_out_tongue:
>
> You get one answer for free. :slight_smile:
>
> Kind regards
>
> robert
>
>

Thank you for the elaboration.

···

On Tue, 14 Dec 2004 07:21:06 +0900, Jim Weirich <jim@weirichhouse.org> wrote:

Robert Klemme said:

> You'll likely need to catch another exception. Try "rescue Exception =>
> e" and then print e's class.

The error in question is Timeout::Error which inherits from Interrupt
which in turn inherits from SignalException. Since a plain vanilla rescue
clause will only rescue exceptions deriving from StandardError (and
SignalException is not derived from StandardError), it won't pick up this
exception.

If you use

  begin
    # stuff
  rescue Timeout::Error => ex
    # handle timeout
  end

you should be ok.

--
-- Jim Weirich jim@weirichhouse.org http://onestepback.org
-----------------------------------------------------------------
"Beware of bugs in the above code; I have only proved it correct,
not tried it." -- Donald Knuth (in a memo to Peter van Emde Boas)