Memory leak

Your solution is good and simple, but there is one problem... I'll have 1-2 mil links in my database to crawl. With only one process it will take months.
Beside URLs in my list sometimes are very slow, and response time may take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous sockets, but this is too complicated for me right now, maybe in the next versions.

Might want to look at EventMachine: http://rubyeventmachine.com/

example (from lib/em/protocols/httpclient2.rb)

    # === Usage

···

From: "Rob Doug" <broken.m@gmail.com>
    #
    # EM.run{
    # conn = EM::Protocols::HttpClient2.connect 'google.com', 80
    #
    # req = conn.get('/')
    # req.callback{ |response|
    # p(response.status)
    # p(response.headers)
    # p(response.content)
    # }
    # }

You could create multiple such concurrent connections. And if your OS
supports it, EventMachine will make use of kqueue or epoll behind the
scenes for efficient handling of large numbers of I/O handles.

This is a simplistic example, but should work in principle:

sites = %w(aaa.com bbb.com ccc.com ddd.com eee.com fff.com) # etc...
conns = sites.map {|dn| EM::Protocols::HttpClient2.connect(dn, 80)}
conns.each do |conn|
    req = conn.get('/')
    req.callback do |response|
      p(response.status)
      p(response.headers)
      p(response.content)
    end
end

This all takes place in a single thread.

Regards,

Bill

I'm sure he didn't mean (only) 1 process but several processes each
running a subset of the input space. It could also have the benefit
of utilizing multiple processing cores on one machine or across
multiple machines:

http://raa.ruby-lang.org/project/rq/
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/298739

···

On Thu, Oct 22, 2009 at 1:12 PM, Rob Doug <broken.m@gmail.com> wrote:

John W Higgins wrote:
> Evening Rob,
>
> May I ask why you need threads and that level of complication? Are you
> really that sensitive towards speed that it really matters if you simply
> forked processes that died after 100 downloads and just started another
> worker? It would seem to be easier to use a nice simple independant
> message
> queue and just fire up workers that grab messages and download away.
> I know it's not the sexy option but what exactly are you gaining by
> using
> threads here? Much easier to get a nice single threaded worker tuned up
> and
> tight then what you appear to be going through here.
Your solution is good and simple, but there is one problem... I'll have
1-2 mil links in my database to crawl. With only one process it will
take months.
Beside URLs in my list sometimes are very slow, and response time may
take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous
sockets, but this is too complicated for me right now, maybe in the next
versions.

Btw, what version of Ruby is this? IIRC there was a bug with
Array#shift up to 1.8.6 which could also cause these effects.

ruby 1.8.6 (2008-08-11 patchlevel 287) - it's strange because I've
updated it week ago :slight_smile:

Right now testing on ruby 1.8.7 ... issue still exits :frowning:

···

--
Posted via http://www.ruby-forum.com/\.

Btw, what version of Ruby is this? IIRC there was a bug with
Array#shift up to 1.8.6 which could also cause these effects.

ruby 1.8.6 (2008-08-11 patchlevel 287) - it's strange because I've
updated it week ago :slight_smile:

wow, when I used ruby 1.8.6, max amount of memory for the program was
500-600mb... with ruby 1.8.7 it can easy get more than 1GB

···

--
Posted via http://www.ruby-forum.com/\.

You create threads and fork a process for every single item to process. This has some consequences:

- your threads will eat all the entries in the queue very quickly
- you will get a large number of processes immediately

In this setup you do neither need threads nor a queue. Basically you just need to iterate the input list and fork off a process for every item you meet. However, then you do not have any control over concurrency and your CPU will suffer. With the setup you presented you should at least have threads wait for their processes to return so a single thread does not fork off more than one process at a time.

Kind regards

  robert

···

On 10/23/2009 11:28 PM, Rob Doug wrote:

Well, seems I found solution...
I tried to make some test on python as well. Simple script, previously posted, eat memory on python too... and the only way I had it to use forks. I checked out forkoff, but produce some strange bugs. This is the working code:

threads = (1..THREADS).map do
   Thread.new q do |qq|
     until qq.equal?(myLink = qq.deq)
       mutex.synchronize do
         puts ($n +=1).to_s # + " : " + print_class_counts.to_s
       end
       fork # <----- You need to fork it, after exit fork will release memory
         begin
           agent = WWW::Mechanize.new{ |agent|
           agent.history.max_size=1
           agent.open_timeout = 20
           agent.read_timeout = 40
           agent.user_agent_alias = 'Windows IE 7'
           agent.keep_alive = false
           }
           page = agent.get(myLink)
           puts myLink
           puts page.forms.length

           page.forms.each do |form|
           end
         rescue
         end
       end
     end
   end
end

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Well, seems error not in my code. I made simple version of crawler,
there is just could not be my mistake. But, the program is still have
memory leak :frowning: ... you can check yourself, here is the code(after 2000
random urls mem usage >200mb):

require 'rubygems'
require 'mechanize'
require 'thread'

mutex = Mutex.new

threads = []
$n = 0

THREADS = 50
q = SizedQueue.new(THREADS * 2)

threads = (1..THREADS).map do
   Thread.new q do |qq|
     until qq.equal?(myLink = qq.deq)
       mutex.synchronize do
         puts ($n +=1).to_s # + " : " + print_class_counts.to_s
       end
       begin
         agent = WWW::Mechanize.new{ |agent|
        agent.history.max_size=1
        agent.open_timeout = 20
        agent.read_timeout = 40
        agent.user_agent_alias = 'Windows IE 7'
        agent.keep_alive = false
    }
    page = agent.get(myLink)
    puts myLink
    puts page.forms.length

    page.forms.each do |form|
    end
    rescue
    end
     end
   end
end

File.foreach("bases/base.txt") do |line|
   line.chomp!
   q.enq(line)
end

threads.size.times { q.enq q}
sleep(120)

threads.each { |t| t.join}

It's very sad, because I like ruby very much, but seems it does not fit
to my projects :frowning:

···

--
Posted via http://www.ruby-forum.com/.

You create threads and fork a process for every single item to process.
  This has some consequences:

- your threads will eat all the entries in the queue very quickly
- you will get a large number of processes immediately

In this setup you do neither need threads nor a queue. Basically you
just need to iterate the input list and fork off a process for every
item you meet. However, then you do not have any control over
concurrency and your CPU will suffer. With the setup you presented you
should at least have threads wait for their processes to return so a
single thread does not fork off more than one process at a time.

sure, you're right, I forget to write it here, but in my own code there
is Process.wait after each fork :slight_smile:

···

--
Posted via http://www.ruby-forum.com/\.