Memory leak

Hi all,
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?

require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'yaml'
require 'net/http'
require 'uri'
require 'modules/common'

Thread.abort_on_exception = true
$config = YAML.load_file "config.yml"
links = IO.read("bases/3+4.txt").split("\n")
threads = []

links.each do |link|
  if Thread.list.size < 50 then
    threads << Thread.new(link) { |myLink|
      Common.post_it(myLink)
    }
  else
    sleep(1)
    threads.each { |t|
      unless t.status then
        t.join
      end
    }
    puts "total threads: " + Thread.list.size.to_s
    redo
  end
end

threads.each { |t| t.join() }

What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.

When I start this program without threads I getting the same results :frowning:

Please help, is this my fault or something wrong in the libraries ?

···

--
Posted via http://www.ruby-forum.com/.

Rob Doug wrote:

Hi all,
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?

require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'yaml'
require 'net/http'
require 'uri'
require 'modules/common'

Thread.abort_on_exception = true
$config = YAML.load_file "config.yml"
links = IO.read("bases/3+4.txt").split("\n")
threads =

links.each do |link|
  if Thread.list.size < 50 then
    threads << Thread.new(link) { |myLink|
      Common.post_it(myLink)
    }
  else
    sleep(1)
    threads.each { |t|
      unless t.status then
        t.join
      end
    }
    puts "total threads: " + Thread.list.size.to_s
    redo
  end
end

threads.each { |t| t.join() }

What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.

When I start this program without threads I getting the same results :frowning:

Please help, is this my fault or something wrong in the libraries ?

One culprit could be mechanize. Mechanize keeps a history of the pages
you've visited, and if you don't limit that length it will grow infinitely.

Try setting the max history:

  http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000173

···

--
Aaron Patterson
http://tenderlovemaking.com/

links = IO.read("bases/3+4.txt").split("\n")
threads =

links.each do |link|
if Thread.list.size < 50 then
   threads << Thread.new(link) { |myLink|
     Common.post_it(myLink)
   }
else
   sleep(1)
   threads.each { |t|
     unless t.status then
       t.join
     end
   }
   puts "total threads: " + Thread.list.size.to_s
   redo
end
end

threads.each { |t| t.join() }

I can't speak for your supposed "leaks" (pure ruby almost never has leaks, but it does have many sneaky ways to make valid object references that you can overlook), since I can't figure out what your code does or how it is supposed to work. threads is an array of thread objects that you only ever push on and you never clear or shift off of. Also, why oh why oh why are you doing a redo??? Overly complicated code may not be the source of your leak (and in this case it may very well be), but it certainly isn't doing you any favors.

I almost always structure my threads as a pool of workers pulling from a queue of tasks:

···

On Oct 19, 2009, at 18:10 , Rob Doug wrote:

links = Queue.new
IO.read("bases/3+4.txt").split("\n").each do |l|
  links << l
end
threads =

50.times do
  threads << Thread.new do
    until links.empty? do
      Common.post_it links.shift
    end
  end
end

threads.each { |t| t.join() }

As far as I can see you're never cleaning Array threads. You are just appending to the Array and so even terminated threads will stay in there because you also do not reuse threads.

Design criticism: usually you create threads upfront and pass work to them via a Queue then you don't need the sleep crutch and use blocking instead. Also, you can make your code better handle large volumes of URLs by using a more streamed approach along the lines Ryan presented by limiting all resources (not only threads but also size of the queue).

require 'thread'
...

THREADS = 50
q = SizedQueue.new(THREADS * 2)

threads = (1..THREADS).map do
   Thread.new q do qq
     until qq.equal?(url = qq.deq)
       Common....
     end
   end
end

File.foreach("bases/3+4.txt") do |line|
   line.chomp!
   q.enq(URI.parse(line))
end

threads.size.times {|q| q.enq q}

Kind regards

  robert

···

On 20.10.2009 03:10, Rob Doug wrote:

Hi all,
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?

require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'yaml'
require 'net/http'
require 'uri'
require 'modules/common'

Thread.abort_on_exception = true
$config = YAML.load_file "config.yml"
links = IO.read("bases/3+4.txt").split("\n")
threads =

links.each do |link|
  if Thread.list.size < 50 then
    threads << Thread.new(link) { |myLink|
      Common.post_it(myLink)
    }
  else
    sleep(1)
    threads.each { |t|
      unless t.status then
        t.join
      end
    }
    puts "total threads: " + Thread.list.size.to_s
    redo
  end
end

threads.each { |t| t.join() }

What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.

When I start this program without threads I getting the same results :frowning:

Please help, is this my fault or something wrong in the libraries ?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

In a .net windows application the form might have resource leaks though
it is running with managed codes. We can use the below procedure to
check if a form is having resource leak.
Open Windows Task Manager
Click on Process tab.
Select "View" in the menu and then select "Select Columns" menu item.
Check the USER Objects and GDI Objects (check boxes) to make them appear
on the process page list header.
The code in your project that lunches the Win Form, please ensure you
have called the Dispose method. Forms implement the IDisposable
interface so their dispose method must be called the moment they are no
longer needed (Test 1). We can call Dispose explicitly or even better to
instantiate it implicitly by the help of using clause (Test 2).
We can use the GC.Collect() after the using statement or after the call
to dispose for troubleshooting purpose.
Now time to launch the form. Please note the note the USER Objects and
GDI Objects values at the task manager. Close the form after some time
and when the form is closed, note the values again at the task manager.
We can find the value is decreased if it is increased then there is a
leak in the form.
Fix the resource leak and remove the call to GC.Collect(). It is
generally unnecessary to make an explicit call to GC.Collect().

···

--
Posted via http://www.ruby-forum.com/.

One culprit could be mechanize. Mechanize keeps a history of the pages
you've visited, and if you don't limit that length it will grow
infinitely.

Try setting the max history:

  http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000173

Thanks, but I've already try this... mem leak happens even when I use
"net/http" with Hpricot

···

--
Posted via http://www.ruby-forum.com/\.

Ryan Davis wrote:

   threads.each { |t|
     unless t.status then
       t.join
     end
   }
   puts "total threads: " + Thread.list.size.to_s
   redo
end
end

threads.each { |t| t.join() }

I can't speak for your supposed "leaks" (pure ruby almost never has
leaks, but it does have many sneaky ways to make valid object
references that you can overlook), since I can't figure out what your
code does or how it is supposed to work. threads is an array of thread
objects that you only ever push on and you never clear or shift off
of. Also, why oh why oh why are you doing a redo??? Overly complicated
code may not be the source of your leak (and in this case it may very
well be), but it certainly isn't doing you any favors.

I almost always structure my threads as a pool of workers pulling from
a queue of tasks:

actually I had this kind of code at first:

links.each do |link|
  if Thread.list.size < 50 then
    threads << Thread.new(link) { |myLink|
      Common.post_it(myLink,doorway)
    }
  else
    sleep(1)
    threads.each { |t|
      unless t.status then
        t.join
      end
    }
    threads = threads.delete_if {|t| t.status == false}
    GC.start
    redo
  end
end

but it does not help, and I reduce this code

···

On Oct 19, 2009, at 18:10 , Rob Doug wrote:

--
Posted via http://www.ruby-forum.com/\.

As far as I can see you're never cleaning Array threads. You are just
appending to the Array and so even terminated threads will stay in there
because you also do not reuse threads.

Design criticism: usually you create threads upfront and pass work to
them via a Queue then you don't need the sleep crutch and use blocking
instead. Also, you can make your code better handle large volumes of
URLs by using a more streamed approach along the lines Ryan presented by
limiting all resources (not only threads but also size of the queue).

require 'thread'
...

THREADS = 50
q = SizedQueue.new(THREADS * 2)

threads = (1..THREADS).map do
   Thread.new q do qq
     until qq.equal?(url = qq.deq)
       Common....
     end
   end
end

File.foreach("bases/3+4.txt") do |line|
   line.chomp!
   q.enq(URI.parse(line))
end

threads.size.times {|q| q.enq q}

Very thank you, your solution is great. Code become much clear and looks
good. :slight_smile:

But, this is not solve memory leak :frowning:

Any ideas or suggestions?

···

--
Posted via http://www.ruby-forum.com/\.

Rob Doug wrote:

One culprit could be mechanize. Mechanize keeps a history of the pages
you've visited, and if you don't limit that length it will grow
infinitely.

Try setting the max history:

  http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000173

Thanks, but I've already try this... mem leak happens even when I use
"net/http" with Hpricot

You'll have to share more of your implementation then. You deduced that
the cause wasn't threads, but you've only shared the threaded part of
your code.

Also, have you run your program with valgrind to verify that it is in
fact a memory leak? It could just be objects sitting around waiting for GC.

···

--
Aaron Patterson
http://tenderlovemaking.com/

yeah. I don't care. it is still really obfuscated. Switch to a simple pool of workers pulling tasks out of a queue. Don't do redo for no reason. Don't do crazy stuff.

···

On Oct 20, 2009, at 07:29 , Rob Doug wrote:

I almost always structure my threads as a pool of workers pulling from
a queue of tasks:

actually I had this kind of code at first:

You could print out object statistics to get an idea about the source of the leak. Something along the lines

require 'pp'

def print_class_counts
   c = Hash.new 0
   ObjectSpace.each_object(Object) do |o|
     c[o.class] += 1
   end
   pp c
end

Cheers

  robert

···

On 21.10.2009 00:47, Rob Doug wrote:

As far as I can see you're never cleaning Array threads. You are just
appending to the Array and so even terminated threads will stay in there
because you also do not reuse threads.

Design criticism: usually you create threads upfront and pass work to
them via a Queue then you don't need the sleep crutch and use blocking
instead. Also, you can make your code better handle large volumes of
URLs by using a more streamed approach along the lines Ryan presented by
limiting all resources (not only threads but also size of the queue).

require 'thread'
...

THREADS = 50
q = SizedQueue.new(THREADS * 2)

threads = (1..THREADS).map do
   Thread.new q do qq
     until qq.equal?(url = qq.deq)
       Common....
     end
   end
end

File.foreach("bases/3+4.txt") do |line|
   line.chomp!
   q.enq(URI.parse(line))
end

threads.size.times {|q| q.enq q}

Very thank you, your solution is great. Code become much clear and looks good. :slight_smile:

But, this is not solve memory leak :frowning:

Any ideas or suggestions?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

You'll have to share more of your implementation then. You deduced that
the cause wasn't threads, but you've only shared the threaded part of
your code.

Also, have you run your program with valgrind to verify that it is in
fact a memory leak? It could just be objects sitting around waiting for
GC.

Here the rest part of program, I did not think here can be any leaks,
this is why I posted code without this part.

module Common
  def Common.post_it(link)
    begin
      puts link
      agent = Net::HTTP
      html = agent.get(URI.parse(link))
      doc = Hpricot(html)
      forms = doc.search "//form"
      forms.each do |form|
            puts form.search("input").length
      end
    rescue => err
      puts err
      file_write_safe("logs/errorlog.txt", "#{err.to_s}|#{link}\n")
      return false
    rescue Timeout::Error
      file_write_safe("logs/errorlog.txt", "Request Timeout|#{link}\n")
    end
  end

  private

  def self.file_write_safe(file,message)
    File.open(file, 'a') { |f|
      f.flock(File::LOCK_EX)
      f.write(message)
      f.flock(File::LOCK_UN)
    }
  end
end

···

--
Posted via http://www.ruby-forum.com/\.

You could print out object statistics to get an idea about the source of
the leak. Something along the lines

require 'pp'

def print_class_counts
   c = Hash.new 0
   ObjectSpace.each_object(Object) do |o|
     c[o.class] += 1
   end
   pp c
end

Cheers

  robert

I retest my script again on 10k+ database and seems it stops to grow
when reach 550mb size... I'm running 50 threads, so 10mb per thread. As
I understand this is still not good ?

Robert, I did what you propose and here is some data. I print all
objects that have more than 200 copies.

Here object after 100 URLs:
"Class: 1039"
"Thread: 395"
"Gem::Version::Part: 343"
"Regexp: 1463"
"MatchData: 301"
"Hpricot::Text: 31848"
"Gem::Version: 424"
"Array: 24977"
"Nokogiri::XML::Element: 1636"
"WWW::Mechanize::Chain: 203"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 627"
"String: 199615"
"Hpricot::Elem: 21211"
"Nokogiri::XML::SyntaxError: 1562"
"Hash: 15884"
"Proc: 949"
"WWW::Mechanize::Form::Option: 458"

Here after 400 URLs:
"Class: 890"
"Gem::Version::Part: 343"
"Regexp: 1089"
"MatchData: 264"
"Hpricot::BogusETag: 1573"
"Hpricot::Text: 82593"
"Gem::Version: 424"
"Array: 53259"
"Nokogiri::XML::Element: 2982"
"WWW::Mechanize::Form::Button: 259"
"WWW::Mechanize::Chain: 242"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 1179"
"String: 426729"
"WWW::Mechanize::Form::RadioButton: 351"
"Nokogiri::XML::SyntaxError: 6464"
"Hash: 37025"
"Hpricot::Elem: 55426"
"Proc: 774"
"WWW::Mechanize::Form: 338"
"WWW::Mechanize::Form::Option: 601"

And here after 500 URLs:
"Class: 991"
"Thread: 354"
"Gem::Version::Part: 343"
"Regexp: 847"
"MatchData: 409"
"Hpricot::BogusETag: 915"
"Hpricot::Text: 31707"
"Gem::Version: 424"
"Array: 24143"
"Nokogiri::XML::Element: 2076"
"WWW::Mechanize::Form::Button: 217"
"WWW::Mechanize::Chain: 218"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 874"
"String: 197836"
"Nokogiri::XML::SyntaxError: 2608"
"Hash: 14719"
"Hpricot::Elem: 21530"
"Proc: 925"
"WWW::Mechanize::Form: 266"
"WWW::Mechanize::Form::Option: 416"

As you can see there is no stable increasing amount of some object.

What can I do more?

···

--
Posted via http://www.ruby-forum.com/\.

Evening Rob,

May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant message
queue and just fire up workers that grab messages and download away.

I know it's not the sexy option but what exactly are you gaining by using
threads here? Much easier to get a nice single threaded worker tuned up and
tight then what you appear to be going through here. In fact - I might argue
that you would be better with a downloader and separater parser processes
that worked independantly via messages.

Just a thought....

John

Your number of threads is suspiciously high compared to the 50 threads that you intend to run. That may be an indication that your code still suffers the issue that you stuff Thread instances into the Array and never remove them.

Other than that I cannot see something obvious on first sight. Maybe it makes more sense to extend the memory dumping to dump only positive deltas. An alternative is to decrease the dumping interval (e.g. just after fetching just one URL).

If you want to throw more on this you could use set_trace_func to capture invocations of methods named "new" along with the stack trace so you get the allocation sites.

Btw, what version of Ruby is this? IIRC there was a bug with Array#shift up to 1.8.6 which could also cause these effects.

Kind regards

  robert

···

On 10/22/2009 04:52 AM, Rob Doug wrote:

You could print out object statistics to get an idea about the source of
the leak. Something along the lines

require 'pp'

def print_class_counts
   c = Hash.new 0
   ObjectSpace.each_object(Object) do |o|
     c[o.class] += 1
   end
   pp c
end

Cheers

  robert

I retest my script again on 10k+ database and seems it stops to grow when reach 550mb size... I'm running 50 threads, so 10mb per thread. As I understand this is still not good ?

Robert, I did what you propose and here is some data. I print all objects that have more than 200 copies.

Here object after 100 URLs:
"Class: 1039"
"Thread: 395"
"Gem::Version::Part: 343"
"Regexp: 1463"
"MatchData: 301"
"Hpricot::Text: 31848"
"Gem::Version: 424"
"Array: 24977"
"Nokogiri::XML::Element: 1636"
"WWW::Mechanize::Chain: 203"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 627"
"String: 199615"
"Hpricot::Elem: 21211"
"Nokogiri::XML::SyntaxError: 1562"
"Hash: 15884"
"Proc: 949"
"WWW::Mechanize::Form::Option: 458"

Here after 400 URLs:
"Class: 890"
"Gem::Version::Part: 343"
"Regexp: 1089"
"MatchData: 264"
"Hpricot::BogusETag: 1573"
"Hpricot::Text: 82593"
"Gem::Version: 424"
"Array: 53259"
"Nokogiri::XML::Element: 2982"
"WWW::Mechanize::Form::Button: 259"
"WWW::Mechanize::Chain: 242"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 1179"
"String: 426729"
"WWW::Mechanize::Form::RadioButton: 351"
"Nokogiri::XML::SyntaxError: 6464"
"Hash: 37025"
"Hpricot::Elem: 55426"
"Proc: 774"
"WWW::Mechanize::Form: 338"
"WWW::Mechanize::Form::Option: 601"

And here after 500 URLs:
"Class: 991"
"Thread: 354"
"Gem::Version::Part: 343"
"Regexp: 847"
"MatchData: 409"
"Hpricot::BogusETag: 915"
"Hpricot::Text: 31707"
"Gem::Version: 424"
"Array: 24143"
"Nokogiri::XML::Element: 2076"
"WWW::Mechanize::Form::Button: 217"
"WWW::Mechanize::Chain: 218"
"Gem::Requirement: 380"
"WWW::Mechanize::Form::Field: 874"
"String: 197836"
"Nokogiri::XML::SyntaxError: 2608"
"Hash: 14719"
"Hpricot::Elem: 21530"
"Proc: 925"
"WWW::Mechanize::Form: 266"
"WWW::Mechanize::Form::Option: 416"

As you can see there is no stable increasing amount of some object.

What can I do more?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

···

On Wed, Oct 21, 2009 at 10:52 PM, Rob Doug <broken.m@gmail.com> wrote:

I retest my script again on 10k+ database and seems it stops to grow
when reach 550mb size... I'm running 50 threads, so 10mb per thread. As
I understand this is still not good ?

With DRb you can even work with those multiple processes like they were in a single process.

Cheers

  robert

···

On 10/22/2009 05:33 AM, John W Higgins wrote:

May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant message
queue and just fire up workers that grab messages and download away.

I know it's not the sexy option but what exactly are you gaining by using
threads here? Much easier to get a nice single threaded worker tuned up and
tight then what you appear to be going through here. In fact - I might argue
that you would be better with a downloader and separater parser processes
that worked independantly via messages.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

John W Higgins wrote:

Evening Rob,

May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant
message
queue and just fire up workers that grab messages and download away.
I know it's not the sexy option but what exactly are you gaining by
using
threads here? Much easier to get a nice single threaded worker tuned up
and
tight then what you appear to be going through here.

Your solution is good and simple, but there is one problem... I'll have
1-2 mil links in my database to crawl. With only one process it will
take months.
Beside URLs in my list sometimes are very slow, and response time may
take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous
sockets, but this is too complicated for me right now, maybe in the next
versions.

In fact - I might
argue
that you would be better with a downloader and separater parser
processes
that worked independantly via messages.

I using mechanize, as I know it parse page right after download with
Nokogiri, so separating threads would be good when I start to use
net/http or sockets. Unfortunately right now I should use mechanize.

···

--
Posted via http://www.ruby-forum.com/\.

Your number of threads is suspiciously high compared to the 50 threads
that you intend to run. That may be an indication that your code still
suffers the issue that you stuff Thread instances into the Array and
never remove them.

Yeah, number of threads confusing me too, but I did as you propose:

THREADS = 50
q = SizedQueue.new(THREADS * 2)
threads =
threads = (1..THREADS).map do
   Thread.new q do |qq|
     until qq.equal?(myLink = qq.deq)
       mutex.synchronize do
         puts $n +=1 # show number of link
         if ($n % 100) == 0 then
           print_class_counts # show classes information
         end
       end
       Common2.post_it(myLink)
     end
   end
end

So I just don't understand how amount of threads can be more than 50

Other than that I cannot see something obvious on first sight. Maybe it
makes more sense to extend the memory dumping to dump only positive
deltas. An alternative is to decrease the dumping interval (e.g. just
after fetching just one URL).

If you want to throw more on this you could use set_trace_func to
capture invocations of methods named "new" along with the stack trace so
you get the allocation sites.

Btw, what version of Ruby is this? IIRC there was a bug with
Array#shift up to 1.8.6 which could also cause these effects.

ruby 1.8.6 (2008-08-11 patchlevel 287) - it's strange because I've
updated it week ago :slight_smile:

···

--
Posted via http://www.ruby-forum.com/\.

Well, seems I found solution...
I tried to make some test on python as well. Simple script, previously
posted, eat memory on python too... and the only way I had it to use
forks. I checked out forkoff, but produce some strange bugs. This is the
working code:

threads = (1..THREADS).map do
   Thread.new q do |qq|
     until qq.equal?(myLink = qq.deq)
       mutex.synchronize do
         puts ($n +=1).to_s # + " : " + print_class_counts.to_s
       end
       fork # <----- You need to fork it, after exit fork will release
memory
         begin
           agent = WWW::Mechanize.new{ |agent|
           agent.history.max_size=1
           agent.open_timeout = 20
           agent.read_timeout = 40
           agent.user_agent_alias = 'Windows IE 7'
           agent.keep_alive = false
           }
           page = agent.get(myLink)
           puts myLink
           puts page.forms.length

           page.forms.each do |form|
           end
         rescue
         end
       end
     end
   end
end

···

--
Posted via http://www.ruby-forum.com/.