Hi all,
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?
links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end
threads.each { |t| t.join() }
What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.
When I start this program without threads I getting the same results
Please help, is this my fault or something wrong in the libraries ?
Hi all,
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?
links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end
threads.each { |t| t.join() }
What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.
When I start this program without threads I getting the same results
Please help, is this my fault or something wrong in the libraries ?
One culprit could be mechanize. Mechanize keeps a history of the pages
you've visited, and if you don't limit that length it will grow infinitely.
links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end
threads.each { |t| t.join() }
I can't speak for your supposed "leaks" (pure ruby almost never has leaks, but it does have many sneaky ways to make valid object references that you can overlook), since I can't figure out what your code does or how it is supposed to work. threads is an array of thread objects that you only ever push on and you never clear or shift off of. Also, why oh why oh why are you doing a redo??? Overly complicated code may not be the source of your leak (and in this case it may very well be), but it certainly isn't doing you any favors.
I almost always structure my threads as a pool of workers pulling from a queue of tasks:
···
On Oct 19, 2009, at 18:10 , Rob Doug wrote:
links = Queue.new
IO.read("bases/3+4.txt").split("\n").each do |l|
links << l
end
threads =
50.times do
threads << Thread.new do
until links.empty? do
Common.post_it links.shift
end
end
end
As far as I can see you're never cleaning Array threads. You are just appending to the Array and so even terminated threads will stay in there because you also do not reuse threads.
Design criticism: usually you create threads upfront and pass work to them via a Queue then you don't need the sleep crutch and use blocking instead. Also, you can make your code better handle large volumes of URLs by using a more streamed approach along the lines Ryan presented by limiting all resources (not only threads but also size of the queue).
require 'thread'
...
THREADS = 50
q = SizedQueue.new(THREADS * 2)
threads = (1..THREADS).map do
Thread.new q do qq
until qq.equal?(url = qq.deq)
Common....
end
end
end
File.foreach("bases/3+4.txt") do |line|
line.chomp!
q.enq(URI.parse(line))
end
threads.size.times {|q| q.enq q}
Kind regards
robert
···
On 20.10.2009 03:10, Rob Doug wrote:
Hi all,
I'm writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?
links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end
threads.each { |t| t.join() }
What in "Common" module:
1. Crawler (net/http or mechanize - I tried both, results the same)
2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
results)
so I extract some data from page and save it to the file. Nothing
special as you see.
When I start this program without threads I getting the same results
Please help, is this my fault or something wrong in the libraries ?
In a .net windows application the form might have resource leaks though
it is running with managed codes. We can use the below procedure to
check if a form is having resource leak.
Open Windows Task Manager
Click on Process tab.
Select "View" in the menu and then select "Select Columns" menu item.
Check the USER Objects and GDI Objects (check boxes) to make them appear
on the process page list header.
The code in your project that lunches the Win Form, please ensure you
have called the Dispose method. Forms implement the IDisposable
interface so their dispose method must be called the moment they are no
longer needed (Test 1). We can call Dispose explicitly or even better to
instantiate it implicitly by the help of using clause (Test 2).
We can use the GC.Collect() after the using statement or after the call
to dispose for troubleshooting purpose.
Now time to launch the form. Please note the note the USER Objects and
GDI Objects values at the task manager. Close the form after some time
and when the form is closed, note the values again at the task manager.
We can find the value is decreased if it is increased then there is a
leak in the form.
Fix the resource leak and remove the call to GC.Collect(). It is
generally unnecessary to make an explicit call to GC.Collect().
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end
threads.each { |t| t.join() }
I can't speak for your supposed "leaks" (pure ruby almost never has
leaks, but it does have many sneaky ways to make valid object
references that you can overlook), since I can't figure out what your
code does or how it is supposed to work. threads is an array of thread
objects that you only ever push on and you never clear or shift off
of. Also, why oh why oh why are you doing a redo??? Overly complicated
code may not be the source of your leak (and in this case it may very
well be), but it certainly isn't doing you any favors.
I almost always structure my threads as a pool of workers pulling from
a queue of tasks:
actually I had this kind of code at first:
links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink,doorway)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
threads = threads.delete_if {|t| t.status == false}
GC.start
redo
end
end
As far as I can see you're never cleaning Array threads. You are just
appending to the Array and so even terminated threads will stay in there
because you also do not reuse threads.
Design criticism: usually you create threads upfront and pass work to
them via a Queue then you don't need the sleep crutch and use blocking
instead. Also, you can make your code better handle large volumes of
URLs by using a more streamed approach along the lines Ryan presented by
limiting all resources (not only threads but also size of the queue).
require 'thread'
...
THREADS = 50
q = SizedQueue.new(THREADS * 2)
threads = (1..THREADS).map do
Thread.new q do qq
until qq.equal?(url = qq.deq)
Common....
end
end
end
File.foreach("bases/3+4.txt") do |line|
line.chomp!
q.enq(URI.parse(line))
end
threads.size.times {|q| q.enq q}
Very thank you, your solution is great. Code become much clear and looks
good.
Thanks, but I've already try this... mem leak happens even when I use
"net/http" with Hpricot
You'll have to share more of your implementation then. You deduced that
the cause wasn't threads, but you've only shared the threaded part of
your code.
Also, have you run your program with valgrind to verify that it is in
fact a memory leak? It could just be objects sitting around waiting for GC.
yeah. I don't care. it is still really obfuscated. Switch to a simple pool of workers pulling tasks out of a queue. Don't do redo for no reason. Don't do crazy stuff.
···
On Oct 20, 2009, at 07:29 , Rob Doug wrote:
I almost always structure my threads as a pool of workers pulling from
a queue of tasks:
You could print out object statistics to get an idea about the source of the leak. Something along the lines
require 'pp'
def print_class_counts
c = Hash.new 0
ObjectSpace.each_object(Object) do |o|
c[o.class] += 1
end
pp c
end
Cheers
robert
···
On 21.10.2009 00:47, Rob Doug wrote:
As far as I can see you're never cleaning Array threads. You are just
appending to the Array and so even terminated threads will stay in there
because you also do not reuse threads.
Design criticism: usually you create threads upfront and pass work to
them via a Queue then you don't need the sleep crutch and use blocking
instead. Also, you can make your code better handle large volumes of
URLs by using a more streamed approach along the lines Ryan presented by
limiting all resources (not only threads but also size of the queue).
require 'thread'
...
THREADS = 50
q = SizedQueue.new(THREADS * 2)
threads = (1..THREADS).map do
Thread.new q do qq
until qq.equal?(url = qq.deq)
Common....
end
end
end
File.foreach("bases/3+4.txt") do |line|
line.chomp!
q.enq(URI.parse(line))
end
threads.size.times {|q| q.enq q}
Very thank you, your solution is great. Code become much clear and looks good.
You'll have to share more of your implementation then. You deduced that
the cause wasn't threads, but you've only shared the threaded part of
your code.
Also, have you run your program with valgrind to verify that it is in
fact a memory leak? It could just be objects sitting around waiting for
GC.
Here the rest part of program, I did not think here can be any leaks,
this is why I posted code without this part.
module Common
def Common.post_it(link)
begin
puts link
agent = Net::HTTP
html = agent.get(URI.parse(link))
doc = Hpricot(html)
forms = doc.search "//form"
forms.each do |form|
puts form.search("input").length
end
rescue => err
puts err
file_write_safe("logs/errorlog.txt", "#{err.to_s}|#{link}\n")
return false
rescue Timeout::Error
file_write_safe("logs/errorlog.txt", "Request Timeout|#{link}\n")
end
end
private
def self.file_write_safe(file,message)
File.open(file, 'a') { |f|
f.flock(File::LOCK_EX)
f.write(message)
f.flock(File::LOCK_UN)
}
end
end
You could print out object statistics to get an idea about the source of
the leak. Something along the lines
require 'pp'
def print_class_counts
c = Hash.new 0
ObjectSpace.each_object(Object) do |o|
c[o.class] += 1
end
pp c
end
Cheers
robert
I retest my script again on 10k+ database and seems it stops to grow
when reach 550mb size... I'm running 50 threads, so 10mb per thread. As
I understand this is still not good ?
Robert, I did what you propose and here is some data. I print all
objects that have more than 200 copies.
May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant message
queue and just fire up workers that grab messages and download away.
I know it's not the sexy option but what exactly are you gaining by using
threads here? Much easier to get a nice single threaded worker tuned up and
tight then what you appear to be going through here. In fact - I might argue
that you would be better with a downloader and separater parser processes
that worked independantly via messages.
Your number of threads is suspiciously high compared to the 50 threads that you intend to run. That may be an indication that your code still suffers the issue that you stuff Thread instances into the Array and never remove them.
Other than that I cannot see something obvious on first sight. Maybe it makes more sense to extend the memory dumping to dump only positive deltas. An alternative is to decrease the dumping interval (e.g. just after fetching just one URL).
If you want to throw more on this you could use set_trace_func to capture invocations of methods named "new" along with the stack trace so you get the allocation sites.
Btw, what version of Ruby is this? IIRC there was a bug with Array#shift up to 1.8.6 which could also cause these effects.
Kind regards
robert
···
On 10/22/2009 04:52 AM, Rob Doug wrote:
You could print out object statistics to get an idea about the source of
the leak. Something along the lines
require 'pp'
def print_class_counts
c = Hash.new 0
ObjectSpace.each_object(Object) do |o|
c[o.class] += 1
end
pp c
end
Cheers
robert
I retest my script again on 10k+ database and seems it stops to grow when reach 550mb size... I'm running 50 threads, so 10mb per thread. As I understand this is still not good ?
Robert, I did what you propose and here is some data. I print all objects that have more than 200 copies.
On Wed, Oct 21, 2009 at 10:52 PM, Rob Doug <broken.m@gmail.com> wrote:
I retest my script again on 10k+ database and seems it stops to grow
when reach 550mb size... I'm running 50 threads, so 10mb per thread. As
I understand this is still not good ?
With DRb you can even work with those multiple processes like they were in a single process.
Cheers
robert
···
On 10/22/2009 05:33 AM, John W Higgins wrote:
May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant message
queue and just fire up workers that grab messages and download away.
I know it's not the sexy option but what exactly are you gaining by using
threads here? Much easier to get a nice single threaded worker tuned up and
tight then what you appear to be going through here. In fact - I might argue
that you would be better with a downloader and separater parser processes
that worked independantly via messages.
May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant
message
queue and just fire up workers that grab messages and download away.
I know it's not the sexy option but what exactly are you gaining by
using
threads here? Much easier to get a nice single threaded worker tuned up
and
tight then what you appear to be going through here.
Your solution is good and simple, but there is one problem... I'll have
1-2 mil links in my database to crawl. With only one process it will
take months.
Beside URLs in my list sometimes are very slow, and response time may
take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous
sockets, but this is too complicated for me right now, maybe in the next
versions.
In fact - I might
argue
that you would be better with a downloader and separater parser
processes
that worked independantly via messages.
I using mechanize, as I know it parse page right after download with
Nokogiri, so separating threads would be good when I start to use
net/http or sockets. Unfortunately right now I should use mechanize.
Your number of threads is suspiciously high compared to the 50 threads
that you intend to run. That may be an indication that your code still
suffers the issue that you stuff Thread instances into the Array and
never remove them.
Yeah, number of threads confusing me too, but I did as you propose:
THREADS = 50
q = SizedQueue.new(THREADS * 2)
threads =
threads = (1..THREADS).map do
Thread.new q do |qq|
until qq.equal?(myLink = qq.deq)
mutex.synchronize do
puts $n +=1 # show number of link
if ($n % 100) == 0 then
print_class_counts # show classes information
end
end
Common2.post_it(myLink)
end
end
end
So I just don't understand how amount of threads can be more than 50
Other than that I cannot see something obvious on first sight. Maybe it
makes more sense to extend the memory dumping to dump only positive
deltas. An alternative is to decrease the dumping interval (e.g. just
after fetching just one URL).
If you want to throw more on this you could use set_trace_func to
capture invocations of methods named "new" along with the stack trace so
you get the allocation sites.
Btw, what version of Ruby is this? IIRC there was a bug with
Array#shift up to 1.8.6 which could also cause these effects.
ruby 1.8.6 (2008-08-11 patchlevel 287) - it's strange because I've
updated it week ago
Well, seems I found solution...
I tried to make some test on python as well. Simple script, previously
posted, eat memory on python too... and the only way I had it to use
forks. I checked out forkoff, but produce some strange bugs. This is the
working code:
threads = (1..THREADS).map do
Thread.new q do |qq|
until qq.equal?(myLink = qq.deq)
mutex.synchronize do
puts ($n +=1).to_s # + " : " + print_class_counts.to_s
end
fork # <----- You need to fork it, after exit fork will release
memory
begin
agent = WWW::Mechanize.new{ |agent|
agent.history.max_size=1
agent.open_timeout = 20
agent.read_timeout = 40
agent.user_agent_alias = 'Windows IE 7'
agent.keep_alive = false
}
page = agent.get(myLink)
puts myLink
puts page.forms.length
page.forms.each do |form|
end
rescue
end
end
end
end
end