[QUIZ] IP to Country (#139)

James_Edward_Gray_II · 18 September 2007 12:35

Great. Run it for us and let us know how we do.

James Edward Gray II

···

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

Anyway -- i'd like to see a 100000 lookups comparison *hehe*

Bill_Kelly · 19 September 2007 17:33

Ok, here's the result of mine:

mark@server1:~/rubyquiz/139$ time ./ip2country.rb 195.135.211.255
UK

real 0m0.004s
user 0m0.004s
sys 0m0.000s

Here's my code:

#!/usr/bin/ruby
ARGV.each { |ip|
f = ip.split(/\./).join "/"
puts File.open(f).readlines[0] rescue puts "Unknown"
}

I think it's pretty obvious what the preparation step was. Of course,
the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not?

LOL! Nice...

Pretty clever.

I bet with the right prep, this could even be a pretty viable approach. Instead of building a file for each address you could create a directory structure for the hexadecimal representations of each piece of the address. The final layer could be handled as you have here or with a search through a much smaller file.

Indeed, I was thinking last night about preprocessing the data
into an on-disk hash table. I was thinking about a flat file
at the time, but one could use a subdirectory technique like the
above... using hex components of the hash value to index through
a couple subdirs to reach a leaf file containing one or more
records.

Another thing I'd like to try but won't have time for, is to
use ruby-mmap somehow. (Preprocess the data into a flat binary
file, then use ruby-mmap when performing the binary search.)

Anyway thanks for the fun quiz.

Regards,

Bill

···

From: "James Edward Gray II" <james@grayproductions.net>

On Sep 19, 2007, at 10:15 AM, Mark Thomas wrote:

James_Edward_Gray_II · 14 September 2007 19:38

Wow, I'm impressed. Can't wait to see that code!

James Edward Gray II

···

On Sep 14, 2007, at 2:35 PM, Simon Kröger wrote:

James Edward Gray II wrote:

On Sep 14, 2007, at 2:20 PM, Simon Kröger wrote:

Ruby Quiz wrote:

[...]

    $ time ruby ip_to_country.rb 68.97.89.187
    US

    real 0m0.314s
    user 0m0.259s
    sys 0m0.053s

Is an 'initialisation run' allowed to massage the data?
(we should at least split the benchmarks to keep it fair)

My script does need and initialization run, yes. I don't see any harm
in paying a one time penalty to set things up right.

Is it motivating or a spoiler to post timings?

Motivating, definitely.

James Edward Gray II

Ok, my script does not need any initialization, it uses the file
IpToCountry.csv exactly as downloaded.

----------------------------------------------------------------
$ ruby -v
ruby 1.8.4 (2005-12-24) [i386-cygwin]

$ time ruby quiz139.rb 68.97.89.187
US

real 0m0.047s
user 0m0.030s
sys 0m0.030s

$ time ruby quiz139.rb 84.191.4.10
DE

real 0m0.046s
user 0m0.046s
sys 0m0.015s
----------------------------------------------------------------

Diego_Scataglini · 14 September 2007 21:58

I think that the timing of the scripts are not a good index. It all depends on what hardware/os you are running it on.
If we want to use speed as an index we should probably have J.E. compare them all on the same machine.

Maybe we could also write a ruby script that runs all the entry scripts and time them, and that could be another ruby quiz which will also be voted on speed and then we could write a ruby script to time those entries an then we could .... Just ignore this paragraph

Diego Scataglini

···

On Sep 14, 2007, at 3:35 PM, Simon Kröger <SimonKroeger@gmx.de> wrote:

James Edward Gray II wrote:

On Sep 14, 2007, at 2:20 PM, Simon Kröger wrote:

Ruby Quiz wrote:

[...]

   $ time ruby ip_to_country.rb 68.97.89.187
   US

   real 0m0.314s
   user 0m0.259s
   sys 0m0.053s

Is an 'initialisation run' allowed to massage the data?
(we should at least split the benchmarks to keep it fair)

My script does need and initialization run, yes. I don't see any harm
in paying a one time penalty to set things up right.

Is it motivating or a spoiler to post timings?

Motivating, definitely.

James Edward Gray II

Ok, my script does not need any initialization, it uses the file
IpToCountry.csv exactly as downloaded.

----------------------------------------------------------------
$ ruby -v
ruby 1.8.4 (2005-12-24) [i386-cygwin]

$ time ruby quiz139.rb 68.97.89.187
US

real 0m0.047s
user 0m0.030s
sys 0m0.030s

$ time ruby quiz139.rb 84.191.4.10
DE

real 0m0.046s
user 0m0.046s
sys 0m0.015s
----------------------------------------------------------------

This is on a Pentium M 2.13GHz Laptop with 2GB RAM and rather slow HD.

cheers

Simon

Bill_Kelly · 14 September 2007 23:45

Ok, my script does not need any initialization, it uses the file
IpToCountry.csv exactly as downloaded.

We probably did something similar. Mine also works on the
unmodified IpToCountry.csv file.

$ time ruby 139_ip_to_country.rb 67.19.248.74 70.87.101.66 205.234.109.18 217.146.186.221 62.75.166.87
US
GB
DE

real 0m0.122s
user 0m0.015s
sys 0m0.000s

(ruby 1.8.4 (2005-12-24) [i386-mswin32], timed from cygwin bash shell,
2GHz athlon64, winxp.)

I don't think the timings are very accurate on this system. It
didn't change much whether I looked up one IP or five.

. . . Looking up 80 IPs on one command line resulted in:

real 0m0.242s
user 0m0.015s
sys 0m0.016s

Regards,

Bill

···

From: "Simon Kröger" <SimonKroeger@gmx.de>

Bill_Kelly · 17 September 2007 00:21

"Bill Kelly" <billk@cts.com> wrote in message news:00b301c7f897$a8c9dc20$6442a8c0@musicbox...

From: "Eugene Kalenkovich" <rubify@softover.com>

BTW, all solutions already submitted will lie for subnets 1,2 and 5
Most (but not all) will break on out of bounds submissions (256.256.256.256 or 0.0.0.-1, latter if comments are stripped out)

Hi, could you clarify what is meant by lying about subnets
1, 2, and 5?

Check what ccountry is 5.1.1.1. If you get any valid answer - this answer is a lie

Ah, OK. I get:

ruby 139_ip_to_country.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1 5.1.1.1
ZZ
(1.1.1.1 not found)
(2.1.1.1 not found)
US
(5.1.1.1 not found)

Regards,

Bill

···

From: "Eugene Kalenkovich" <rubify@softover.com>

Erik_Veenstra1 · 17 September 2007 21:12

$ ruby quiz139.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1 5.1.1.1
0.1.1.1 ZZ
1.1.1.1 ??
2.1.1.1 ??
3.1.1.1 US
4.1.1.1 US
5.1.1.1 ??

:}

gegroet,
Erik V.

Matthias_Wachter · 18 September 2007 23:23

James Edward Gray II schrieb:

Anyway -- i'd like to see a 100000 lookups comparison *hehe*

Great. Run it for us and let us know how we do.

Here are the results of the supplied solutions so far, and it looks like my solution can take the 100k-performance victory

First Table: Compilation (Table Packing)

real user sys
Adam[*] 0.005 0.002 0.003
Luis 0.655 0.648 0.007
James[**] 21.089 18.142 0.051
Jesse 1.314 1.295 0.020
Matthias 0.718 0.711 0.008

[*]: Adam does not perform a real compression but he builds two boundaries to search within the original .csv he subsequently uses.
[**]: Upon rebuild, James fetches the .csv sources from the web making his solution look slow. This output highly depends on your--actually my--ISP speed.

Second Table: Run (100_000 Addresses)

real user sys
Adam 24.943 22.993 1.951
Bill 35.080 33.029 2.051
Luis 16.149 13.706 2.444
Eugene[*] 52.307 48.689 3.620
Eugene 65.790 61.984 3.805
James 14.803 12.449 2.356
Jesse 14.016 12.343 1.673
Jesus_a[**]
Jesus_b[**]
Kevin[***]
Matt_file 6.192 5.332 0.859
Matt_str 3.704 3.699 0.005
Simon 69.417 64.679 4.706
Justin 56.639 53.292 3.345
steve 63.659 54.355 9.294

[*]: Eugene already implements a random generator. But to make things fair, I changed his implementation to read the same values from $stdin as all the other implementations. The "Star" version is using his own random generator and runs outside competition, the starless version is my modified one.
[**]: O Jesus :), I can't make your FasterCSV version (a) run, and in the later version you sent your direct parsing breaks when it comes to detecting the commented lines in the first part of the file. I couldn't manage to make it run, sorry.
[***]: Although I managed to write the missing SQL insertion script and to even add separate indexes for the address limits, Kevin's SQLite3 version took simply too long. I estimated a run time of over an hour. I am willing to replay the test if someone tells me how to speed up things with SQLite3 to make it competitive.

Note that I slightly changed all implementations to contain a loop that iterates on $stdin.each instead of ARGV or using just ARGV[0]. For the test the script was run only once and was supplied with all addresses in one run. The test set consisted of 100_000 freshly generated random IP addresses written to a file and supplied using the following syntax:

$ (time ruby IpToCountry.rb <IP100k > /dev/null) 2>100k.time

I didn't check the output of the scripts, although I checked one address upfront. This was mainly because all scripts have a different output format. My tests were just for measuring the performance.

Just for Info:

$ uname -a
Linux sabayon2me 2.6.22-sabayon #1 SMP Mon Sep 3 00:33:06 UTC 2007 x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]
$ cat /etc/sabayon-release
Sabayon Linux x86-64 3.4

- Matthias

···

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

Simon_Kroger · 14 September 2007 20:20

Wow, I'm impressed. Can't wait to see that code!

Thanks.

Because startup time startet to take over the benchmark:

···

----------------------------------------------------------------
$ time ruby quiz139.rb 68.97.89.187 84.191.4.10 80.79.64.128 210.185.128.123
202.10.4.222 192.189.119.1
US
DE
RU
JP
AU
EU

real 0m0.078s
user 0m0.046s
sys 0m0.031s
----------------------------------------------------------------

and by the way: thanks for telling me such a database exists!

cheers

Simon

James_Edward_Gray_II · 14 September 2007 22:16

I think that the timing of the scripts are not a good index. It all depends on what hardware/os you are running it on.
If we want to use speed as an index we should probably have J.E. compare them all on the same machine.

I view it that we are getting a rough idea of speeds, not exact counts. Both scripts timed so far seem to be able to answer the question in under a second on semi-current hardware. Good enough for me.

Maybe we could also write a ruby script that runs all the entry scripts and time them, and that could be another ruby quiz which will also be voted on speed and then we could write a ruby script to time those entries an then we could .... Just ignore this paragraph

Thank you for volunteering...

James Edward Gray II

···

On Sep 14, 2007, at 4:58 PM, diego scataglini wrote:

James_Edward_Gray_II · 17 September 2007 01:46

Well, we weren't all broken:

$ ruby ip_to_country.rb 5.1.1.1
Unknown

James Edward Gray II

···

On Sep 16, 2007, at 7:21 PM, Bill Kelly wrote:

From: "Eugene Kalenkovich" <rubify@softover.com>

"Bill Kelly" <billk@cts.com> wrote in message news:00b301c7f897$a8c9dc20$6442a8c0@musicbox...

From: "Eugene Kalenkovich" <rubify@softover.com>

BTW, all solutions already submitted will lie for subnets 1,2 and 5
Most (but not all) will break on out of bounds submissions (256.256.256.256 or 0.0.0.-1, latter if comments are stripped out)

Hi, could you clarify what is meant by lying about subnets
1, 2, and 5?

Check what ccountry is 5.1.1.1. If you get any valid answer - this answer is a lie

Ah, OK. I get:

ruby 139_ip_to_country.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1 5.1.1.1
ZZ
(1.1.1.1 not found)
(2.1.1.1 not found)
US
(5.1.1.1 not found)

Bill_Kelly · 18 September 2007 23:44

James Edward Gray II schrieb:

Anyway -- i'd like to see a 100000 lookups comparison *hehe*

Great. Run it for us and let us know how we do.

Here are the results of the supplied solutions so far, and it looks like my solution can take the 100k-performance victory

Thanks for putting that together! Fun to see the different
times.

If I've understood correctly, it looks like my solution seems
to be the fastest (so far) of those that operate on the
unmodified .csv file?

I wasn't expecting that, at all...

I would have bet Simon's would be faster. Strange!

Regards,

Bill

···

From: "Matthias Wächter" <matthias@waechter.wiz.at>

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

James_Edward_Gray_II · 19 September 2007 03:12

Thank you very much for putting this together and wow, your code is lightening quick.

James Edward Gray II

···

On Sep 18, 2007, at 6:23 PM, Matthias Wächter wrote:

James Edward Gray II schrieb:

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

Anyway -- i'd like to see a 100000 lookups comparison *hehe*

Great. Run it for us and let us know how we do.

Here are the results of the supplied solutions so far, and it looks like my solution can take the 100k-performance victory

Jesus_Gabriel_y_Gala · 19 September 2007 07:27

Hi,

Yes, I only tested with my version of the file, from which I manually
removed the comments. I don't think I'll have time to fix that, at
least this week. Anyway, I find strange the FasterCSV version doesn't
work, because it delegates the parsing of the file to that gem, and
the rest is pretty simple. On the other hand I don't expect the first
version to perform anywhere near the other solutions, so it's not so
important :-).

Jesus.

···

On 9/19/07, Matthias Wächter <matthias@waechter.wiz.at> wrote:

James Edward Gray II schrieb:
> On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:
>> Anyway -- i'd like to see a 100000 lookups comparison *hehe*
> Great. Run it for us and let us know how we do.

[**]: O Jesus :), I can't make your FasterCSV version (a) run, and in
the later version you sent your direct parsing breaks when it comes to
detecting the commented lines in the first part of the file. I couldn't
manage to make it run, sorry.

Eugene_Kalenkovich2 · 17 September 2007 04:00

"James Edward Gray II" <james@grayproductions.net> wrote in message
news:4A414BD9-8B56-4897-A13E-

Well, we weren't all broken:

I've sent my comment and refreshed new headers - and yes, your solution came
in

--EK

Adam_Shelly · 19 September 2007 01:15

>
> Here are the results of the supplied solutions so far, and it looks like my solution can take the 100k-performance victory

If I've understood correctly, it looks like my solution seems
to be the fastest (so far) of those that operate on the
unmodified .csv file?

It depends what you mean by unmodified - my algorithm runs off the
original file, the only "modification" I am doing in the setup stage
is searching for and saving the byte offset of the first and last
records. It looks like l could have done that every time my script
was run and only added 5 ms.

I would have bet Simon's would be faster. Strange!

I thought block file reads would be faster too, that was the next
thing I was planning to try. Maybe it's the regexp that slowed it
down.

-Adam

···

On 9/18/07, Bill Kelly <billk@cts.com> wrote:

>> On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

James_Edward_Gray_II · 19 September 2007 12:19

Comments are not a part of the CSV specification, so FasterCSV doesn't address them. I would like to add ignore patterns at some point though.

James Edward Gray II

···

On Sep 19, 2007, at 2:27 AM, Jesús Gabriel y Galán wrote:

On 9/19/07, Matthias Wächter <matthias@waechter.wiz.at> wrote:

James Edward Gray II schrieb:

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

Anyway -- i'd like to see a 100000 lookups comparison *hehe*

Great. Run it for us and let us know how we do.

[**]: O Jesus :), I can't make your FasterCSV version (a) run, and in
the later version you sent your direct parsing breaks when it comes to
detecting the commented lines in the first part of the file. I couldn't
manage to make it run, sorry.

Hi,

Yes, I only tested with my version of the file, from which I manually
removed the comments. I don't think I'll have time to fix that, at
least this week. Anyway, I find strange the FasterCSV version doesn't
work, because it delegates the parsing of the file to that gem, and
the rest is pretty simple.

Bill_Kelly · 19 September 2007 02:33

>
> If I've understood correctly, it looks like my solution seems
> to be the fastest (so far) of those that operate on the
> unmodified .csv file?

It depends what you mean by unmodified - my algorithm runs off the
original file, the only "modification" I am doing in the setup stage
is searching for and saving the byte offset of the first and last
records. It looks like l could have done that every time my script
was run and only added 5 ms.

Ah, I see. Cool.

Incidentally since the file format description indicates comment
lines may appear anywhere in the file, I allowed for that. However,
I doubt adding a loop to your gets/split logic to keep going until a
valid record was found would affect your time much at all.

Nice job

Regards,

Bill

···

From: "Adam Shelly" <adam.shelly@gmail.com>
On 9/18/07, Bill Kelly <billk@cts.com> wrote:

Simon_Kroger · 19 September 2007 20:45

Adam Shelly wrote:

I would have bet Simon's would be faster. Strange!

I thought block file reads would be faster too, that was the next
thing I was planning to try. Maybe it's the regexp that slowed it
down.

Without looking at the other solutions in detail i think one of the problems
may be that my solution opens the file for each lookup - thats of course easy
to fix. I don't know if thats the problem or the overhead of creating ten
thousand IPAddr objects - i refuse to analyse this in depth because i don't
have a usecase for looking up that many locations in a single run.

(on the other hand i do understand how much fun it can be to optimize such a
problem to death - so go on if you like, i don't have the motivation - this time

cheers

Simon