Your solution is good and simple, but there is one problem... I'll have 1-2 mil links in my database to crawl. With only one process it will take months.
Beside URLs in my list sometimes are very slow, and response time may take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous sockets, but this is too complicated for me right now, maybe in the next versions.
You could create multiple such concurrent connections. And if your OS
supports it, EventMachine will make use of kqueue or epoll behind the
scenes for efficient handling of large numbers of I/O handles.
This is a simplistic example, but should work in principle:
sites = %w(aaa.com bbb.com ccc.com ddd.com eee.com fff.com) # etc...
conns = sites.map {|dn| EM::Protocols::HttpClient2.connect(dn, 80)}
conns.each do |conn|
req = conn.get('/')
req.callback do |response|
p(response.status)
p(response.headers)
p(response.content)
end
end
I'm sure he didn't mean (only) 1 process but several processes each
running a subset of the input space. It could also have the benefit
of utilizing multiple processing cores on one machine or across
multiple machines:
On Thu, Oct 22, 2009 at 1:12 PM, Rob Doug <broken.m@gmail.com> wrote:
John W Higgins wrote:
> Evening Rob,
>
> May I ask why you need threads and that level of complication? Are you
> really that sensitive towards speed that it really matters if you simply
> forked processes that died after 100 downloads and just started another
> worker? It would seem to be easier to use a nice simple independant
> message
> queue and just fire up workers that grab messages and download away.
> I know it's not the sexy option but what exactly are you gaining by
> using
> threads here? Much easier to get a nice single threaded worker tuned up
> and
> tight then what you appear to be going through here.
Your solution is good and simple, but there is one problem... I'll have
1-2 mil links in my database to crawl. With only one process it will
take months.
Beside URLs in my list sometimes are very slow, and response time may
take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous
sockets, but this is too complicated for me right now, maybe in the next
versions.
You create threads and fork a process for every single item to process. This has some consequences:
- your threads will eat all the entries in the queue very quickly
- you will get a large number of processes immediately
In this setup you do neither need threads nor a queue. Basically you just need to iterate the input list and fork off a process for every item you meet. However, then you do not have any control over concurrency and your CPU will suffer. With the setup you presented you should at least have threads wait for their processes to return so a single thread does not fork off more than one process at a time.
Kind regards
robert
···
On 10/23/2009 11:28 PM, Rob Doug wrote:
Well, seems I found solution...
I tried to make some test on python as well. Simple script, previously posted, eat memory on python too... and the only way I had it to use forks. I checked out forkoff, but produce some strange bugs. This is the working code:
threads = (1..THREADS).map do
Thread.new q do |qq|
until qq.equal?(myLink = qq.deq)
mutex.synchronize do
puts ($n +=1).to_s # + " : " + print_class_counts.to_s
end
fork # <----- You need to fork it, after exit fork will release memory
begin
agent = WWW::Mechanize.new{ |agent|
agent.history.max_size=1
agent.open_timeout = 20
agent.read_timeout = 40
agent.user_agent_alias = 'Windows IE 7'
agent.keep_alive = false
}
page = agent.get(myLink)
puts myLink
puts page.forms.length
page.forms.each do |form|
end
rescue
end
end
end
end
end
Well, seems error not in my code. I made simple version of crawler,
there is just could not be my mistake. But, the program is still have
memory leak ... you can check yourself, here is the code(after 2000
random urls mem usage >200mb):
You create threads and fork a process for every single item to process.
This has some consequences:
- your threads will eat all the entries in the queue very quickly
- you will get a large number of processes immediately
In this setup you do neither need threads nor a queue. Basically you
just need to iterate the input list and fork off a process for every
item you meet. However, then you do not have any control over
concurrency and your CPU will suffer. With the setup you presented you
should at least have threads wait for their processes to return so a
single thread does not fork off more than one process at a time.
sure, you're right, I forget to write it here, but in my own code there
is Process.wait after each fork