I am implementing a very simple script to ping web servers or services (to monitor how our environment is functionning).
Some production url's run on more than one host. Therefore, I start a thread for each separate url.
The function run in my threads is:
def doPing(uri_string, probe)
s = uri_string
while true
begin
timeout(@seconds_before_timeout) do |timeout_length|
start = Time.new
begin
open(s) do |result|
if result.status[0] != "200"
probe.addToLogFile([s,'ERR',0,result.status[1]])
else
probe.addToLogFile([s,'OK',Time.new - start,''])
end
end
rescue Exception
probe.addToLogFile([s,'ERR',0,$!])
end
end
rescue Timeout::Error
probe.addToLogFile([s,'ERR',0,'timeout'])
end
sleep(@seconds_between_ping)
end
end
However, this is a problem. Indeed, I want also to measure the time (round-trip) it takes for the ping (these are only simple pings for the time being). As you can see I get the local time before and after the call. But this doesn't work with threads. Indeed, since the process is shared among threads, the time will be dependent on the number of threads I am running and not a correct view of the actual time it takes to ping.
I can't define the piece of code between the two times as critical and only for one thread because if the open-uri blocks, it will prevent another thread to ping another url in the mean time.
Any idea? Maybe use processes instead of threads?
Thanks for any hints!
···
--
Alexander Lamb
Service d'Informatique Médicale
Hôpitaux Universitaires de Genève Alexander.J.Lamb@sim.hcuge.ch
+41 22 372 88 62
+41 79 420 79 73
I am implementing a very simple script to ping web servers or
services (to monitor how our environment is functionning).
Some production url's run on more than one host. Therefore, I start a
thread for each separate url.
The function run in my threads is:
def doPing(uri_string, probe)
s = uri_string
while true
begin
timeout(@seconds_before_timeout) do |timeout_length|
start = Time.new
begin
open(s) do |result|
if result.status[0] != "200"
probe.addToLogFile([s,'ERR',0,result.status[1]])
else
probe.addToLogFile([s,'OK',Time.new - start,''])
end
end
rescue Exception
probe.addToLogFile([s,'ERR',0,$!])
end
end
rescue Timeout::Error
probe.addToLogFile([s,'ERR',0,'timeout'])
end
sleep(@seconds_between_ping)
end
end
However, this is a problem. Indeed, I want also to measure the time
(round-trip) it takes for the ping (these are only simple pings for
the time being). As you can see I get the local time before and after
the call. But this doesn't work with threads. Indeed, since the
process is shared among threads, the time will be dependent on the
number of threads I am running and not a correct view of the actual
time it takes to ping.
I can't define the piece of code between the two times as critical
and only for one thread because if the open-uri blocks, it will
prevent another thread to ping another url in the mean time.
Any idea? Maybe use processes instead of threads?
Maybe you can exploit one of the result headers. Chances are that there
is a timestamp somewhere. Then you *only* need to synchronize clocks on
your machine and on servers...
Or you switch to a single thread solution. I don't know your ping
interval but if you don't need to ping too often and don't have too many
servers that should be ok. You can create a simple scheduling that always
picks the URL with the closest ping point...
Alexander Lamb <Alexander.J.Lamb@sim.hcuge.ch> writes:
However, this is a problem. Indeed, I want also to measure the time
(round-trip) it takes for the ping (these are only simple pings for
the time being). As you can see I get the local time before and after
the call.
But this doesn't work with threads. Indeed, since the process is
shared among threads, the time will be dependent on the number of
threads I am running and not a correct view of the actual time it
takes to ping.
Using process instead of thread would also have the same problem. You
still can't guarantee that your execution path is not suspended
between start time to end time; the CPU is still shared among
processes.
You'd have to use an OS that can give you this guarantee. The catch
is, there is no port of ruby to such OS yet.
But probably you don't need a guarantee, just a 'good enough' is good
enough. In this case, there is little benefit made from using
processes, and personally, I won't bother to do so for a simple
monitoring program.
I can't define the piece of code between the two times as critical
Even if you define it, you can't count on the ruby process not being
suspended by some external factors.
The fundamental problem is simple: limited resources. As long as there
are multiple executions desiring access to the same resources (cpu
time, network time), there is bound to be some contentions.
Any idea? Maybe use processes instead of threads?
For critical monitoring services, one use an OS that can guarantee
some amount of cpu time within some duration to a process. Examples of
this is found in many places like in your car's ABS controller and
your local neighbourhood's nuclear power station.
I could go single thread (since indeed I am doing a ping per 30 seconds more or less). However, if the first one I try hangs (until a timeout for example), I am pushing back the time at which I will ping the second url. Logically I would need to do something like "ping each url one after another unless one of them seems to take longer and then detach a thread to wait for the answer".
For the time being I will test forking a process.
Alex
···
On Oct 21, 2005, at 3:01 PM, Robert Klemme wrote:
Alexander Lamb wrote:
Hello list,
I am implementing a very simple script to ping web servers or
services (to monitor how our environment is functionning).
Some production url's run on more than one host. Therefore, I start a
thread for each separate url.
The function run in my threads is:
def doPing(uri_string, probe)
s = uri_string
while true
begin
timeout(@seconds_before_timeout) do |timeout_length|
start = Time.new
begin
open(s) do |result|
if result.status[0] != "200"
probe.addToLogFile([s,'ERR',0,result.status[1]])
else
probe.addToLogFile([s,'OK',Time.new - start,''])
end
end
rescue Exception
probe.addToLogFile([s,'ERR',0,$!])
end
end
rescue Timeout::Error
probe.addToLogFile([s,'ERR',0,'timeout'])
end
sleep(@seconds_between_ping)
end
end
However, this is a problem. Indeed, I want also to measure the time
(round-trip) it takes for the ping (these are only simple pings for
the time being). As you can see I get the local time before and after
the call. But this doesn't work with threads. Indeed, since the
process is shared among threads, the time will be dependent on the
number of threads I am running and not a correct view of the actual
time it takes to ping.
I can't define the piece of code between the two times as critical
and only for one thread because if the open-uri blocks, it will
prevent another thread to ping another url in the mean time.
Any idea? Maybe use processes instead of threads?
Maybe you can exploit one of the result headers. Chances are that there
is a timestamp somewhere. Then you *only* need to synchronize clocks on
your machine and on servers...
Or you switch to a single thread solution. I don't know your ping
interval but if you don't need to ping too often and don't have too many
servers that should be ok. You can create a simple scheduling that always
picks the URL with the closest ping point...
I am implementing a very simple script to ping web servers or
services (to monitor how our environment is functionning).
Some production url's run on more than one host. Therefore, I start
a thread for each separate url.
The function run in my threads is:
def doPing(uri_string, probe)
s = uri_string
while true
begin
timeout(@seconds_before_timeout) do |timeout_length|
start = Time.new
begin
open(s) do |result|
if result.status[0] != "200"
probe.addToLogFile([s,'ERR',0,result.status[1]])
else
probe.addToLogFile([s,'OK',Time.new - start,''])
end
end
rescue Exception
probe.addToLogFile([s,'ERR',0,$!])
end
end
rescue Timeout::Error
probe.addToLogFile([s,'ERR',0,'timeout'])
end
sleep(@seconds_between_ping)
end
end
However, this is a problem. Indeed, I want also to measure the time
(round-trip) it takes for the ping (these are only simple pings for
the time being). As you can see I get the local time before and
after the call. But this doesn't work with threads. Indeed, since
the process is shared among threads, the time will be dependent on
the number of threads I am running and not a correct view of the
actual time it takes to ping.
I can't define the piece of code between the two times as critical
and only for one thread because if the open-uri blocks, it will
prevent another thread to ping another url in the mean time.
Any idea? Maybe use processes instead of threads?
Maybe you can exploit one of the result headers. Chances are that
there
is a timestamp somewhere. Then you *only* need to synchronize
clocks on
your machine and on servers...
Or you switch to a single thread solution. I don't know your ping
interval but if you don't need to ping too often and don't have too
many
servers that should be ok. You can create a simple scheduling that
always
picks the URL with the closest ping point...
I could go single thread (since indeed I am doing a ping per 30
seconds more or less). However, if the first one I try hangs (until a
timeout for example), I am pushing back the time at which I will ping
the second url. Logically I would need to do something like "ping
each url one after another unless one of them seems to take longer
and then detach a thread to wait for the answer".
For the time being I will test forking a process.
You could also have a controller thread that watches your single testing
thread. If the testing takes longer than n seconds (where n << timeout)
it sets a flag for the current testing thread (with a thread local
variable for example) and starts a new tester thread.
I am implementing a very simple script to ping web servers or
services (to monitor how our environment is functionning).
Some production url's run on more than one host. Therefore, I start
a thread for each separate url.
The function run in my threads is:
def doPing(uri_string, probe)
s = uri_string
while true
begin
timeout(@seconds_before_timeout) do |timeout_length|
start = Time.new
begin
open(s) do |result|
if result.status[0] != "200"
probe.addToLogFile([s,'ERR',0,result.status[1]])
else
probe.addToLogFile([s,'OK',Time.new - start,''])
end
end
rescue Exception
probe.addToLogFile([s,'ERR',0,$!])
end
end
rescue Timeout::Error
probe.addToLogFile([s,'ERR',0,'timeout'])
end
sleep(@seconds_between_ping)
end
end
However, this is a problem. Indeed, I want also to measure the time
(round-trip) it takes for the ping (these are only simple pings for
the time being). As you can see I get the local time before and
after the call. But this doesn't work with threads. Indeed, since
the process is shared among threads, the time will be dependent on
the number of threads I am running and not a correct view of the
actual time it takes to ping.
I can't define the piece of code between the two times as critical
and only for one thread because if the open-uri blocks, it will
prevent another thread to ping another url in the mean time.
Any idea? Maybe use processes instead of threads?
Maybe you can exploit one of the result headers. Chances are that
there
is a timestamp somewhere. Then you *only* need to synchronize
clocks on
your machine and on servers...
Or you switch to a single thread solution. I don't know your ping
interval but if you don't need to ping too often and don't have too
many
servers that should be ok. You can create a simple scheduling that
always
picks the URL with the closest ping point...
I could go single thread (since indeed I am doing a ping per 30
seconds more or less). However, if the first one I try hangs (until a
timeout for example), I am pushing back the time at which I will ping
the second url. Logically I would need to do something like "ping
each url one after another unless one of them seems to take longer
and then detach a thread to wait for the answer".
For the time being I will test forking a process.
You could also have a controller thread that watches your single
testing thread. If the testing takes longer than n seconds (where n
<< timeout) it sets a flag for the current testing thread (with a
thread local variable for example) and starts a new tester thread.
Yet another idea: you make the testing semi critical. When a thread
starts testing it stores a timestamp somewhere. Every other thread checks
whether the timestamp is set and is only max n seconds away. If it's
longer, replace the timestamp with it's own timestamp and go ahead. If
we're still in the n seconds range, go on sleeping.
Well, as you said: "good enough" is ok since I need to test a real life situation. For example I have several apps calling some web services on some given servers. Obviously if many apps call at the same time the timing will be different, as is my probe in Ruby. However, I had the feeling that by starting my two or three threads exactly at the same time to ping some servers, the timing result is not really correct. Using processes, of course several processes will fight for resources but I am more in a real life situation.
However, reading your replies, I thing a good approximation is to slightly offset by 3-4 seconds each ping (e.g. not starting all the threads at the same time). Then I have a very high probability, even after several hours of pings, to have only one ping thread running at a time thus giving me a good approximation of the time taken.
Slightly off topic: what I am trying to do is to monitor the way our systems work. We are very distributed and need to setup alarms if some service goes down. A little bit like products such as BigBrother but more application oriented. I saw on the agenda of Euruku05 :
Using Ruby to monitor enterprise software from Sven C. Koehler.
There are no slides or description but it could be something similar to what I am trying to do (actually that we already did in a previous version in Java but I wanted to simplify it and make it more customizable). Can someone give me pointers or maybe even Mr. Koehler if he is on this list?
Thanks,
···
--
Alexander Lamb
Service d'Informatique Médicale
Hôpitaux Universitaires de Genève
Alexander.J.Lamb@sim.hcuge.ch
+41 22 372 88 62
+41 79 420 79 73
On Oct 21, 2005, at 3:26 PM, Robert Klemme wrote:
Robert Klemme wrote:
Alexander Lamb wrote:
On Oct 21, 2005, at 3:01 PM, Robert Klemme wrote:
Alexander Lamb wrote:
Hello list,
I am implementing a very simple script to ping web servers or
services (to monitor how our environment is functionning).
Some production url's run on more than one host. Therefore, I start
a thread for each separate url.
The function run in my threads is:
def doPing(uri_string, probe)
s = uri_string
while true
begin
timeout(@seconds_before_timeout) do |timeout_length|
start = Time.new
begin
open(s) do |result|
if result.status[0] != "200"
probe.addToLogFile([s,'ERR',0,result.status[1]])
else
probe.addToLogFile([s,'OK',Time.new - start,''])
end
end
rescue Exception
probe.addToLogFile([s,'ERR',0,$!])
end
end
rescue Timeout::Error
probe.addToLogFile([s,'ERR',0,'timeout'])
end
sleep(@seconds_between_ping)
end
end
However, this is a problem. Indeed, I want also to measure the time
(round-trip) it takes for the ping (these are only simple pings for
the time being). As you can see I get the local time before and
after the call. But this doesn't work with threads. Indeed, since
the process is shared among threads, the time will be dependent on
the number of threads I am running and not a correct view of the
actual time it takes to ping.
I can't define the piece of code between the two times as critical
and only for one thread because if the open-uri blocks, it will
prevent another thread to ping another url in the mean time.
Any idea? Maybe use processes instead of threads?
Maybe you can exploit one of the result headers. Chances are that
there
is a timestamp somewhere. Then you *only* need to synchronize
clocks on
your machine and on servers...
Or you switch to a single thread solution. I don't know your ping
interval but if you don't need to ping too often and don't have too
many
servers that should be ok. You can create a simple scheduling that
always
picks the URL with the closest ping point...
I could go single thread (since indeed I am doing a ping per 30
seconds more or less). However, if the first one I try hangs (until a
timeout for example), I am pushing back the time at which I will ping
the second url. Logically I would need to do something like "ping
each url one after another unless one of them seems to take longer
and then detach a thread to wait for the answer".
For the time being I will test forking a process.
You could also have a controller thread that watches your single
testing thread. If the testing takes longer than n seconds (where n
<< timeout) it sets a flag for the current testing thread (with a
thread local variable for example) and starts a new tester thread.
Yet another idea: you make the testing semi critical. When a thread
starts testing it stores a timestamp somewhere. Every other thread checks
whether the timestamp is set and is only max n seconds away. If it's
longer, replace the timestamp with it's own timestamp and go ahead. If
we're still in the n seconds range, go on sleeping.
I am implementing a very simple script to ping web servers or
services (to monitor how our environment is functionning).
This thread seems to thrashing around a bit.
By what criteria do you establish how your environment is functioning? Once you know that you can look into how to monitor those criteria alone and no others.
A 'ping' is really testing the network and web server responsiveness (assuming the ping task is simple). You can't separate those. Establish a base line and compare to that. I'd think that if the network or server is stalling for any reason you'd like to know, and when comparing to a base line you have a chance of detecting that.
The application responsiveness is probably best measured on the server by the application itself, possibly by recording the time between first touch on the app through to the close or flush of the socket. You'd have to ask the application to report on this.
I am implementing a very simple script to ping web servers or
services (to monitor how our environment is functionning).
This thread seems to thrashing around a bit.
By what criteria do you establish how your environment is functioning? Once you know that you can look into how to monitor those criteria alone and no others.
A 'ping' is really testing the network and web server responsiveness (assuming the ping task is simple). You can't separate those. Establish a base line and compare to that. I'd think that if the network or server is stalling for any reason you'd like to know, and when comparing to a base line you have a chance of detecting that.
The application responsiveness is probably best measured on the server by the application itself, possibly by recording the time between first touch on the app through to the close or flush of the socket. You'd have to ask the application to report on this.
Yes, you are right. This ping is really only the first building block (but you can't imagine how many times we had problems because of a simple Apache server which didn't restart gracefully at midnight:-) at the same time the apps were running fine... just nobody could get to them, which is rather annoying in a hospital.
That's why we have a monitoring system which monitors not only the Apache servers but some key WebServices as well. To monitor WebServices we either do some kind of dummy search (e.g. give me the list of patients in this unit) or implement a specific service which shall test a few things and reply with some information about speed of the transaction and other things.
What you saw in my post is the first block ( a simple HTTP Ping probe) for the new system I wish to develop in Ruby. I will then have a sort of master process which will consolidate the states of the various probes and display (to be defined: how) the situation. If some situation seems critical, then some kind of alert will be escalated.
This brings to another question (sorry I am a beginner in Ruby): is there a rule engine available in Ruby where you could express rules a bit like with Jess in Java?
I am implementing a very simple script to ping web servers or
services (to monitor how our environment is functionning).
This thread seems to thrashing around a bit.
By what criteria do you establish how your environment is functioning? Once you know that you can look into how to monitor those criteria alone and no others.
A 'ping' is really testing the network and web server responsiveness (assuming the ping task is simple). You can't separate those. Establish a base line and compare to that. I'd think that if the network or server is stalling for any reason you'd like to know, and when comparing to a base line you have a chance of detecting that.
The application responsiveness is probably best measured on the server by the application itself, possibly by recording the time between first touch on the app through to the close or flush of the socket. You'd have to ask the application to report on this.
Yes, you are right. This ping is really only the first building block (but you can't imagine how many times we had problems because of a simple Apache server which didn't restart gracefully at midnight:-) at the same time the apps were running fine... just nobody could get to them, which is rather annoying in a hospital.
I most certainly can imagine it... well, actually, I can rely on memory
That's why we have a monitoring system which monitors not only the Apache servers but some key WebServices as well. To monitor WebServices we either do some kind of dummy search (e.g. give me the list of patients in this unit) or implement a specific service which shall test a few things and reply with some information about speed of the transaction and other things.
What you saw in my post is the first block ( a simple HTTP Ping probe) for the new system I wish to develop in Ruby. I will then have a sort of master process which will consolidate the states of the various probes and display (to be defined: how) the situation. If some situation seems critical, then some kind of alert will be escalated.
Okay, still, I'd *strongly* suggest the baseline thing (given my experience) and establishing service levels too. I've written several monitoring systems, some quite large, sometimes while on a team that built the hardware to run the monitoring system. Comparison to expectations or historical records, while tricky to get working at first, works very well in the end.
This brings to another question (sorry I am a beginner in Ruby): is there a rule engine available in Ruby where you could express rules a bit like with Jess in Java?
I don't know personally. But your rules might be overly 'crisp' and so not very stable (think chaos, and tipping points). I'd have a look around for a simple fuzzy reasoning system (an unfortunate phrase but *way* more stable.
Cheers,
Bob
···
On Oct 21, 2005, at 10:09 AM, Alexander Lamb wrote: