Ruby from command line timing out?

I'm running a script from the command line that's going to take a couple of hours to complete. Between 15 and 20 minutes into its run, the script throws an execution expired (Timeout::Error). Is there an environment variable that I should be looking at modifying? The error message in its entirety is:

/usr/local/lib/ruby/1.8/timeout.rb:42:in `new': execution expired (Timeout::Error)
         from ./spider.rb:6334:in `join'
         from ./spider.rb:6334
         from ./spider.rb:6334:in `each'
         from ./spider.rb:6334

···

--
Jason N Perkins
<http://sneer.org/>

Is it safe to guess, based on the name of the script, that it spiders web pages? If that's the case, Timeout::Error s are going to happen quite frequently as a particular web page loads too slowly.

···

On Jan 8, 2005, at 8:08 PM, Jason N.Perkins wrote:

I'm running a script from the command line that's going to take a couple of hours to complete. Between 15 and 20 minutes into its run, the script throws an execution expired (Timeout::Error). Is there an environment variable that I should be looking at modifying? The error message in its entirety is:

/usr/local/lib/ruby/1.8/timeout.rb:42:in `new': execution expired (Timeout::Error)
        from ./spider.rb:6334:in `join'
        from ./spider.rb:6334
        from ./spider.rb:6334:in `each'
        from ./spider.rb:6334

--
Jason N Perkins
<http://sneer.org/&gt;

Francis Hwang

I'm catching those errors with no problem with a 'rescue'. This seems to be specific to the script itself.

···

On Jan 8, 2005, at 7:14 PM, Francis Hwang wrote:

Is it safe to guess, based on the name of the script, that it spiders web pages? If that's the case, Timeout::Error s are going to happen quite frequently as a particular web page loads too slowly.

--
Jason N Perkins
<http://sneer.org/&gt;

Can you post the code?

Bill

···

On Sun, 9 Jan 2005 10:19:39 +0900, Jason N. Perkins <jperkins@sneer.org> wrote:

On Jan 8, 2005, at 7:14 PM, Francis Hwang wrote:

> Is it safe to guess, based on the name of the script, that it spiders
> web pages? If that's the case, Timeout::Error s are going to happen
> quite frequently as a particular web page loads too slowly.

I'm catching those errors with no problem with a 'rescue'. This seems
to be specific to the script itself.

--
Jason N Perkins
<http://sneer.org/&gt;

--
$stdout.sync = true
"Just another Ruby hacker.".each_byte do |b|
  ('a'..'z').step do|c|print c+"\b";sleep 0.007 end;print b.chr
end; print "\n"

Sure. The blogs variable is an array of the urls of blogs - I intend to eventually have these urls stored in MySQL, but for now an array works. I emptied that array so that those sites that I have in it aren't getting hit by too many people trying to help out. The threading is derived from a sample in "Programming Ruby." I'd love any additional feedback outside of dealing with the timeout issue.

#! /usr/local/bin/ruby -w

require 'open-uri'
require 'thread'

blogs =

buffer=Queue.new

# load the blogs into the queue
blogs.each do |blog|
   buffer.enq( blog )
end

consumers = (1..150).map do |i|
   Thread.new("consumer #{i}") do |name|
     begin
       blog = buffer.deq
       open( blog ) do |content|
         begin
           metas = content.read.scan( /<meta([^(>]*)>/m ).uniq
           metas.each do |current_meta|
             current_meta = current_meta.to_s

             if current_meta =~ /\s+name\s*=\s*[\"']([^\"']+)[\"']/
               name = $1
               current_meta =~ /\s+content\s*=\s*[\"']([^\"']+)[\"']/
               content = $1

               case name
               when "geo.position"
                 print "#{blog} \t #{content} \n"

               when "ICBM"
                 print "#{blog} \t #{content} \n"
               end
             end
           end
         rescue Exception
           p "#{blog}: $! \n"
         end
       end
     end until buffer == :END_OF_WORK
   end
end

begin
   consumers.size.times{ buffer.enq(:END_OF_WORK) }
   consumers.each{|th| th.join}
rescue Exception
   print $!
end

···

On Jan 8, 2005, at 7:21 PM, Bill Atkins wrote:

Can you post the code?

--
Jason N Perkins
<http://sneer.org/&gt;

Jason,

Is the line 6334 that shows up in the traceback this line:

  consumers.each{|th| th.join}

And one tip, which may not have anything to do with this problem but might make your code easier to understand and/or debug: Since threading is so bloody difficult, I try to make it affect as little of the program as possible. In a case like your code, for example, I would've let the threaded part simply handle the loading of the web pages, but let the parsing happen afterward when all the threads have been joined again. This is how FeedBlender (http://feedblender.rubyforge.org/\) does it, so that way if there's a bug I can figure out if it's because of the threading or not.

···

On Jan 8, 2005, at 8:29 PM, Jason N.Perkins wrote:

On Jan 8, 2005, at 7:21 PM, Bill Atkins wrote:

Can you post the code?

Sure. The blogs variable is an array of the urls of blogs - I intend to eventually have these urls stored in MySQL, but for now an array works. I emptied that array so that those sites that I have in it aren't getting hit by too many people trying to help out. The threading is derived from a sample in "Programming Ruby." I'd love any additional feedback outside of dealing with the timeout issue.

#! /usr/local/bin/ruby -w

require 'open-uri'
require 'thread'

blogs =

buffer=Queue.new

# load the blogs into the queue
blogs.each do |blog|
  buffer.enq( blog )
end

consumers = (1..150).map do |i|
  Thread.new("consumer #{i}") do |name|
    begin
      blog = buffer.deq
      open( blog ) do |content|
        begin
          metas = content.read.scan( /<meta([^(>]*)>/m ).uniq
          metas.each do |current_meta|
            current_meta = current_meta.to_s

            if current_meta =~ /\s+name\s*=\s*[\"']([^\"']+)[\"']/
              name = $1
              current_meta =~ /\s+content\s*=\s*[\"']([^\"']+)[\"']/
              content = $1

              case name
              when "geo.position"
                print "#{blog} \t #{content} \n"

              when "ICBM"
                print "#{blog} \t #{content} \n"
              end
            end
          end
        rescue Exception
          p "#{blog}: $! \n"
        end
      end
    end until buffer == :END_OF_WORK
  end
end

begin
  consumers.size.times{ buffer.enq(:END_OF_WORK) }
  consumers.each{|th| th.join}
rescue Exception
  print $!
end

--
Jason N Perkins
<http://sneer.org/&gt;

Francis Hwang

["Jason N.Perkins" <jperkins@sneer.org>, 2005-01-09 02.29 CET]

begin
  consumers.size.times{ buffer.enq(:END_OF_WORK) }
  consumers.each{|th| th.join}
rescue Exception
  print $!
end

I think, when the thread that is being "joined" raises timeout error, the
program will finish and the other threads won't be joined. Maybe you should
put the begin...rescue around the join (inside the each).

Hope this helps. Good luck.

Jason,

Is the line 6334 that shows up in the traceback this line:

  consumers.each{|th| th.join}

Yeah, that's the line that's timing out and why I was wondering if there's a global timeout value for the script that I can either modify up or turn off completely.

And one tip, which may not have anything to do with this problem but might make your code easier to understand and/or debug: Since threading is so bloody difficult, I try to make it affect as little of the program as possible. In a case like your code, for example, I would've let the threaded part simply handle the loading of the web pages, but let the parsing happen afterward when all the threads have been joined again. This is how FeedBlender (http://feedblender.rubyforge.org/\) does it, so that way if there's a bug I can figure out if it's because of the threading or not.

OK, I'll give that a try. Thanks, Francis!

···

On Jan 9, 2005, at 9:33 AM, Francis Hwang wrote:

--
Jason N Perkins
<http://sneer.org/&gt;

Timeout::Error comes from timeout.rb.

Your Timeout::Error probably comes out of HTTP, open-uri doesn't require timeout, and has no timeout blocks.

Try Thread.abort_on_exception = true at the top of your script, and remove the begin/end block inside the thread.

PGP.sig (186 Bytes)

···

On 09 Jan 2005, at 09:42, Jason N.Perkins wrote:

On Jan 9, 2005, at 9:33 AM, Francis Hwang wrote:

Jason,

Is the line 6334 that shows up in the traceback this line:

  consumers.each{|th| th.join}

Yeah, that's the line that's timing out and why I was wondering if there's a global timeout value for the script that I can either modify up or turn off completely.

--
Eric Hodel - drbrain@segment7.net - http://segment7.net
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04