[QUIZ] IP to Country (#139)

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

···

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby Quiz is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in. You can find a database for the matching
at:

  http://software77.net/cgi-bin/ip-country/geo-ip.pl

To keep the problem interesting though, let's write our programs with a focus on
speed and memory efficiency.

  $ time ruby ip_to_country.rb 68.97.89.187
  US
  
  real 0m0.314s
  user 0m0.259s
  sys 0m0.053s

Ruby Quiz wrote:

[...]

  $ time ruby ip_to_country.rb 68.97.89.187
  US
  
  real 0m0.314s
  user 0m0.259s
  sys 0m0.053s

Is an 'initialisation run' allowed to massage the data?
(we should at least split the benchmarks to keep it fair)

Is it motivating or a spoiler to post timings?

cheers

Simon

Ruby Quiz wrote:

This week's Ruby Quiz is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in.

I assume that it's not ok for folk to use my Ruby GeoIP gem?
It has a reasonable compromise between speed and memory usage.
:slight_smile:

Clifford Heath.

Hi,

This is my second submission to Ruby Quiz, any tip will be greatly
appreciated. First I downloaded the file and removed all comments,
then tried a straightforward approach. Here it is:

require 'faster_csv'
ip = ARGV[0].split(/\./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 + ip[3].to_i
file = ARGV[1] || 'ipdb.csv'

FasterCSV.foreach(file) do |line|
  if (line[0].to_i <= ip && line[1].to_i >= ip)
    puts line[4]
    exit
  end
end

It allows to pass the IP as the first parameter, then transforms it to
the format in the file (the "base" 256). The file can be passed as the
second argument, by default I used my modified file without the
comments.

The method uses FasterCSV to parse the file line by line, checking ip
against the range. The result for the IP in the example was:

time ruby quiz139a.rb 68.97.89.187
US

real 0m0.355s
user 0m0.328s
sys 0m0.024s

So, it was a little bit more than the one presented with the problem.
Unfortunately, things don't go so good when you search for an IP that
matches entries at the end of the file:

time ruby quiz139a.rb 255.0.0.1
ZZ

real 0m5.337s
user 0m5.036s
sys 0m0.292s

More than 5 seconds :frowning:

So, I went on to build a second version. This time the approach I
took, as the file is ordered is a binary search directly on the file.
Here it is:

require 'ftools'

ip = ARGV[0].split(/\./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 + ip[3].to_i
file = ARGV[1] || 'ipdb.csv'

File.open(file) do |f|
  low = 0
  high = f.stat.size
  f.seek(high / 2)
  while true
    while ((a = f.getc) != 10)
      f.seek(-2, IO::SEEK_CUR)
    end
    pos = f.pos
    line = f.readline.split(",")
    low_range = line[0][1..-2].to_i
    high_range = line[1][1..-2].to_i
    if (low_range > ip)
      high = pos
      offset = (f.pos - pos) + ((high - low) / 2)
      f.seek(-offset, IO::SEEK_CUR)
    elsif (high_range < ip)
      low = f.pos
      f.seek((high-low) / 2, IO::SEEK_CUR)
    else
      puts line[4][1..-2]
      exit
    end
  end
end

What the method does is a binary search in the file. It starts a low
and high variables to manage the partition, then positions the cursor
in the file to the middle. Every time, the method reads back until it
finds a 10 (\n), then reads that line to check the IP range. If the
lower range is higher than the IP, the method moves the cursor in the
file, down to half of the current low-high range. If the higher range
is lower than the IP, then it moves up half of the range. The timings
here are much better:

time ruby quiz139b.rb 68.97.89.187
US

real 0m0.077s
user 0m0.048s
sys 0m0.004s

time ruby quiz139b.rb 255.0.0.1
ZZ

real 0m0.069s
user 0m0.060s
sys 0m0.008s

Apart from the general strategy of the binary search, I haven't had
the time to give too much thought to the details, like how to extract
the values after the readline. Any tips regarding that would be
greatly appreciated. Also I don't know if there's a better way of
moving back in the file to the previous line break (the while (a =
f.getc) != 10 part). Tips?

Thanks for this quiz. I hope I had time to work on every quiz.

Regards,

Jesus.

···

On 9/14/07, Ruby Quiz <james@grayproductions.net> wrote:

This week's Ruby Quiz is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in. You can find a database for the matching
at:

        software77.net

To keep the problem interesting though, let's write our programs with a focus on
speed and memory efficiency.

Ruby Quiz wrote:

This week's Ruby Quiz is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in. You can find a database for the matching
at:

  http://software77.net/cgi-bin/ip-country/geo-ip.pl

To keep the problem interesting though, let's write our programs with a focus on
speed and memory efficiency.

  $ time ruby ip_to_country.rb 68.97.89.187
  US
  
  real 0m0.314s
  user 0m0.259s
  sys 0m0.053s

This is my first submission to Ruby Quiz, so don't expect ground-breaking techniques from this implementation.

Let me characterize my solution: Instead of looking up the csv file directly, I precompile it. The search can then be performed quickly. I think my solution will beat some others when it comes to massive lookups in the range of 100th of thousand lookups.

First, I wrote a random IP address generator. It can alternatively output the IP addresses as a space delimited string or each IP address on a separate line. On my bash submitting all IP address worked up to about 8 k entries until the argument list was too long, so I implemented that the addresses can be given on $stdin also. For a proper argument list on stdin, the entries should be separated by \n, so just add a second argument to rand_ip.rb to generate the list of ips -- the script doesn't care what the second parameter is, it just has to be there (i use "byline").

I use the script like this:

$ ./rand_ip.rb 100000 byline > ip-test-list

Second, I use a precompile run. I don't know how folks here can stay under 100 milliseconds without parsing the file beforehand. My precompilation generates a binary table that stores 10 bytes for each entry: 4 bytes each for start/end of address space and 2 bytes for the country code. Additionally, for saving memory, I squeeze memory areas with adjacent addresses in the same country. This saves more than 50% of the records while it doesn't significantly speed up the search (just about 1 step more in my binary search algorithm, see below). the packer (packr.rb) uses pack("NNa2") for building the binary table, and look, the addresses are stored in network byte order (i.e. big endian). The output table holds about 32 k Entries.

The packer doesn't accept any parameters, just call

$ ./packer.rb

Third comes the actual address search. I implemented two versions for comparison: One with a small RAM footprint that looks up everything in the packed table file. The other one reads the whole table into memory and performs all lookups in memory.

The algorithm is the same in both implementations: I use binary search over all entries until the ip address either matches one range or there is no match at all.

Comparison is sped up by making 4-letter string comparisons of the respective binary addresses. That way "search_ip >= addr_start" works even when the addresses are strings. This saves a lot of time because at search time there is no .to_i calculation avoided.

I run the searchers (search_file.rb, search_str.rb) for small numbers of random IP address using the command line:

$ time ./search_file.rb `./rand_ip.rb 3000` > /dev/null

For more addresses I use one of the following -- the last usage is to make successive runs with equal input:

$ time ./rand_ip.rb 100000 byline | ./search_file.rb > /dev/null
$ ./rand_ip.rb 100000 byline > ip-test-list; time ./search_file.rb < ip-test-list > /dev/null

My results for 100000 lookups are:

$ time ./packer.rb

real 0m1.634s
user 0m1.576s
sys 0m0.056s
$ time ./rand_ip.rb 100000 byline > ip-test-list

real 0m0.797s
user 0m0.740s
sys 0m0.056s
$ time ./search_file.rb < ip-test-list > /dev/null

real 0m11.091s
user 0m9.673s
sys 0m1.420s
$ time ./search_str.rb < ip-test-list > /dev/null

real 0m7.131s
user 0m6.960s
sys 0m0.168s

btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be!
$ time ruby /dev/null

real 0m0.129s
user 0m0.120s
sys 0m0.012s
$ uname -a
Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

Cheers,
- Matthias

rand_ip.rb (140 Bytes)

packer.rb (784 Bytes)

search_file.rb (1.32 KB)

search_str.rb (1.14 KB)

Hi,

one of the (without a doubt) many binary search solutions. This one reads a 1K
chunk of the file around the current position and uses a regexp to extract
valid lines (and values) from this chunk.

The reasoning behind that is that data from disk is read in blocks anyway (in
most cases even larger than 1K) and the last 2 or 3 iterations of the algorithm
can be avoided. (max depth encountered so far was 14)

Using the regexp frees us from 'manually' looking for newlines (it will find
about 10 valid sets of data in each iteration).

I tried a lot of stuff like biasing the median towards the side where the
solution will be most likely found, using 'static' buffers for the File#read
calls or unrolling the recursion but nothing proofed realy usefull so i decided
to go with the short and fast enough(tm) solution.

···

------------------------------------------------------------------------------
require 'ipaddr'

class IPAddr
  def country_code()
    open('IpToCountry.csv'){|file| code(file, to_i, 0, file.stat.size)}
  end

  private
  def code(file, ip, min, max)
    median = (min + max) / 2
    file.seek(median - 512)
    ary = file.read(1024).scan(/^"(\d*)","(\d*)","(?:\w*)","(?:\d*)","(\w*)"/)
    return code(file, ip, median, max) if ary.empty? || ary.last[1].to_i < ip
    return code(file, ip, min, median) if ary.first[0].to_i > ip
    ary.find{|l| l[1].to_i >= ip}[2]
  end
end

ARGV.each{|arg| puts IPAddr.new(arg).country_code} if $0 == __FILE__
------------------------------------------------------------------------------

timings:

$ time ruby quiz139.rb 0.0.0.0 68.97.89.187 84.191.4.10 80.79.64.128
210.185.128.123 202.10.4.222 192.189.119.1 255.255.255.255
ZZ
US
DE
RU
JP
AU
EU
ZZ

real 0m0.062s
user 0m0.046s
sys 0m0.030s

cheers

Simon

Hello,

Had some fun with this one :). Basically, my approach was to download the
.csv database file and use that file to convert from IP => Country code. The
program reads this file each time it processes an individual lookup, with no
precomputations required.

Initially I coded up a linear search (SLOW!) and a binary search that read
the entire file into an array (somewhat faster). But the final "binary seek"
approach is more efficient - it uses IO.seek to only read the specific lines
that it needs, thus providing a good balance of CPU/Memory/IO. Binary seek
does this by taking advantage of the fact that IP ranges in the file are in
numerical order. I was planning to also code up a file that uses a web
service to do the conversion, but I doubt that would win any speed contests
:).

Anyway, the algorithm is a simple modification of a standard binary search:

def binary_seek(ip_addr_num)
   low = 0
   high = File.size(Filename)
   fp = File.open(Filename)

   # Find first line of real data and set "low" placeholder accordingly
   line = fp.gets
   while line.strip.size == 0 or line[0].chr == "#"
    line = fp.gets
    low = low + line.size
   end

   # Then find the corresponding the IP Range, if any
   while (low <= high)
       mid = (low + high) / 2
       fp.seek(mid, IO::SEEK_SET)

       line = fp.gets # read once to find the next line
       line = fp.gets # read again to get the next full line
       from, to, country_code = parse_line(line)

       if (from == nil or from > ip_addr_num) # Safety check
           high = mid - 1
       elsif (to < ip_addr_num)
           low = mid + 1
       else
           fp.close
           return "Binary Seek Search: #{country_code}"
       end
   end

   fp.close
   return "No Country Found"
end

A hard-coded constant points to the look up file:

Filename = "IpToCountry.csv" # IP to Country Code database file

And utility methods are provided to make the core algorithm easier to
follow:

# Convert an IP (V4) address (as string) to a decimal number
# Example from file:
# 1.2.3.4 = 4 + (3 * 256) + (2 * 256 * 256) + (1 * 256 * 256 * 256)
# is 4 + 768 + 13,1072 + 16,777,216 = 16,909,060
def convert_ip_str_to_num(ip_addr_str)
  ip_addr = ip_addr_str.split(".")
  ip_addr_num = ip_addr[0].to_i * 256 ** 3 +
                ip_addr[1].to_i * 256 ** 2 +
                ip_addr[2].to_i * 256 ** 1 +
                ip_addr[3].to_i
  return ip_addr_num
end

# Parse an IP range field of the IP/Country DB file
def parse_line(line)
  line_parsed = line.split(',')

  # Validate to handle comments and unexpected data
  if line_parsed != nil and line_parsed.size >= 5
    from = line_parsed[0].tr('"', '').to_i
    to = line_parsed[1].tr('"', '').to_i
    country_code = line_parsed[4]
  end

  return from, to, country_code
end

Finally, the program simply iterates over a list of input IP addresses:

for ip in ARGV
  puts binary_seek(convert_ip_str_to_num(ip))
end

Here is a test run on a Pentium 4 2.4 Ghz Desktop with 512 MB RAM, using one
of the test cases from the initial email chain:

justin@marketgarden:~/Desktop/139_IP_to_Country$ time ruby IP_to_Country.rb
68.97.89.187 84.191.4.10 80.79.64.128 210.185.128.123 202.10.4.222
192.189.119.1
Binary Seek Search: "US"
Binary Seek Search: "DE"
Binary Seek Search: "RU"
Binary Seek Search: "JP"
Binary Seek Search: "AU"
Binary Seek Search: "EU"

real 0m0.020s
user 0m0.016s
sys 0m0.004s

A pastie is available here: http://pastie.caboo.se/97562
This file contains Binary Seek as well as linear and binary search
implementations.
Thanks,

Justin

···

On 9/14/07, Ruby Quiz <james@grayproductions.net > wrote:

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz
until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby Talk follow the discussion. Please reply to the original quiz
message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby Quiz is to write a simple utility. Your program should
accept
an IP address as a command-line argument and print out the two letter code
for
the country that IP is assigned in. You can find a database for the
matching
at:

         software77.net

To keep the problem interesting though, let's write our programs with a
focus on
speed and memory efficiency.

        $ time ruby ip_to_country.rb 68.97.89.187
        US

        real 0m0.314s
        user 0m0.259s
        sys 0m0.053s

My solution suffers from over-generality (is supposed to work for any sorted
string file by any sorting criteria), and from over-limiting memory usage -
I like Simon's approach, reading in chunks makes everything easier and
faster.

BTW, all solutions already submitted will lie for subnets 1,2 and 5 :slight_smile:
Most (but not all) will break on out of bounds submissions (256.256.256.256
or 0.0.0.-1, latter if comments are stripped out)

(yes, I know, just trying to find an excuse for not very elegant solution -
edge cases kill elegancy :slight_smile:

-----------------------------------------------------------------------------

def bin_find(file,search)
  if block_given?
    compare = lambda { |a, b| yield(a, b) }
  else
    compare = lambda { |a, b| a <=> b }
  end
  open(file) do |f|
    return bin_search(f,search,f.lstat.size,&compare)
  end
end

def bin_search(f,search,size)
  start,finish=0,size
  while start<finish
    dist=(finish-start)/2
    f.seek(start+dist)
    f.readline unless start+dist==0
    case (l1=f.readline.chomp rescue false) ? yield(search,l1) : -1
    when -1
      next if (finish=start+dist)>0
      break
    when 0
      return l1
    else
      case (l2=f.readline.chomp rescue false) ? yield(search,l2) : -1
      when -1
        return l1
      when 0
        return l2
      else
        start+=dist; next
      end
    end
  end
  nil
end

nums=
out=true
if ARGV[0]=='test'
  n=ARGV[1].to_i
  n.times{nums << rand(4294967296)}
  out=false
else
  ARGV.each do |argv|
    nums << ((($1.to_i*256)+$2.to_i)*256+$3.to_i)*256+$4.to_i if
argv=~/(\d{1,3}).(\d{1,3}).(\d{1,3}).(\d{1,3})/
  end
end
if nums.empty?
  puts "Please enter valid ip(s) (or use '#{$0} test NNN' for testing)"
  exit
end

nums.each do |num|
  ctry='Unknown'
  res=bin_find('IpToCountry.csv',num) { |search, str|
    str.empty? || str[0,1]!='"' ? 1 : search <=>
str.gsub('"','').split(',')[0].to_i
  }.gsub('"','').split(',')
  ctry=res[4] if (res[0].to_i..res[1].to_i)===num
  puts ctry if out
end

Greetings!

Thanks for the fun quiz.

Looks like my solution is a little more verbose than many
submitted so far.

Like many others, I'm performing a binary search on the
original data file. For better or for worse, I imposed the
following restrictions on my design, just out of curiosity
for how it would turn out:

  - non-recursive binary search

  - only using seek() and gets() to access the database
    (using gets means no backward scanning for the beginning
    of a record)

Regards,

Bill

====== Begin file: 139_ip_to_country.rb =====

# IP FROM IP TO REGISTRY ASSIGNED CTRY CNTRY COUNTRY
# "1346797568","1346801663","ripencc","20010601","IL","ISR","ISRAEL"

IpRec = Struct.new(:from, :to, :registry, :assigned, :ctry, :cntry, :country) do
  def contains?(ip_value)
    ip_value >= self.from && ip_value <= self.to
  end
end

class IpToCountry
  DATABASE_FNAME = "IptoCountry.csv"

  def initialize(db_fname=DATABASE_FNAME)
    @db_size = File.stat(db_fname).size
    @db = File.open(db_fname, "r")
    @db_pos = 0
  end

  def close
    @db.close
    @db = nil
  end

  # Lookup IpRec containing ip. Exception is raised if not found.
  def search(ip)
    ip_value = self.class.ip_to_int(ip)
    find_rec(ip_value)
  end

  # Binary search through sorted IP database.
  # The search uses non-record-aligned byte offsets into the
  # file, but always scans forward for the next parsable
  # record from a given offset. It keeps track of the
  # byte offset past the end of the parsed record in @db_pos.
  def find_rec(ip_value)
    lower, upper = 0, @db_size - 1
    prev_rec = nil
    loop do
      ofst = lower + ((upper - lower) / 2)
      rec = next_rec_from_ofst(ofst)
      if rec == prev_rec
        # We have narrowed the search to where we're hitting
        # the same record each time. Can't get any narrower.
        # But these are variable-length records, so there may
        # be one or more at our lower bound we haven't seen.
        # Do a linear scan from our lower bound to examine the
        # few records remaining.
        ofst = lower
        while (rec = next_rec_from_ofst(ofst)) != prev_rec
          return rec if rec.contains? ip_value
          ofst = @db_pos
        end
        raise("no record found for ip_value #{ip_value}")
      end
      return rec if rec.contains? ip_value
      if ip_value < rec.from
        upper = ofst
      else
        lower = @db_pos
      end
      prev_rec = rec
    end end

  def next_rec_from_ofst(ofst)
    @db.seek(ofst)
    @db_pos = ofst
    while line = @db.gets
      @db_pos += line.length
      break if rec = self.class.parse_rec(line)
    end
    rec || raise("no record found after ofst #{ofst}")
  end

  def self.ip_to_int(ip_str)
    ip_str.split(".").map{|s|s.to_i}.pack("c4").unpack("N").first
  end

  # NOTE: Using a strict regexp instead of a simpler split operation,
  # because it's important we find a valid record, not one embedded
  # in a comment or such.
  def self.parse_rec(line)
    if line =~ %r{\A \s*"(\d+)"\s*,
                     \s*"(\d+)"\s*,
                     \s*"(\w+)"\s*,
                     \s*"(\d+)"\s*,
                     \s*"(\w+)"\s*,
                     \s*"([^"]+)"\s* \z
                }x
      rec = IpRec.new($1.to_i, $2.to_i, $3, $4.to_i, $5, $6, $7)
    end
  end
end

if $0 == __FILE__

  # Accepts zero-or-more IP addresses on the command line.

  ip2c = IpToCountry.new
    ARGV.each do |ip|
    rec = ip2c.search(ip) rescue nil
    puts( rec ? rec.ctry : "(#{ip} not found)" )
  end

end

====== End file: 139_ip_to_country.rb =====

====== Begin file: test_139_ip_to_country.rb ======

require '139_ip_to_country'

require 'test/unit'

class TestIpToCountry < Test::Unit::TestCase

  # NOTE: the following are the first two and last two records in
  # my database file:
  REC_0 = IpToCountry.parse_rec(%{"0","16777215","IANA","410227200","ZZ","ZZZ","RESERVED"})
  REC_1 = IpToCountry.parse_rec(%{"50331648","67108863","ARIN","572572800","US","USA","UNITED STATES"})
  REC_NEG_1 = IpToCountry.parse_rec(%{"4261412864","4278190079","IANA","410227200","ZZ","ZZZ","RESERVED"})
  REC_LAST = IpToCountry.parse_rec(%{"4278190080","4294967295","IANA","410227200","ZZ","ZZZ","RESERVED"})

  def test_find_rec
    ip2c = IpToCountry.new
    assert_equal( REC_0, ip2c.find_rec(REC_0.from) )
    assert_equal( REC_0, ip2c.find_rec(REC_0.to) )
    assert_equal( REC_1, ip2c.find_rec(REC_1.from) )
    assert_equal( REC_1, ip2c.find_rec(REC_1.to) )
    assert_equal( REC_NEG_1, ip2c.find_rec(REC_NEG_1.from) )
    assert_equal( REC_NEG_1, ip2c.find_rec(REC_NEG_1.to) )
    assert_equal( REC_LAST, ip2c.find_rec(REC_LAST.from) )
    assert_equal( REC_LAST, ip2c.find_rec(REC_LAST.to) )
    ip2c.close
  end

  def test_search
    ip2c = IpToCountry.new
    rec = ip2c.search("67.19.248.74")
    assert_not_nil( rec )
    assert_equal( "ARIN", rec.registry )
    assert_equal( "US", rec.ctry )
  end

end

====== End file: test_139_ip_to_country.rb ======

I decided to make one extra, reusable (?), abstraction layer:
Class OrderedLinesFile implements method find_line, which is
given a block. This block is called with a line of the file.
The block has to parse this line and returns -1, 0 or 1 for "go
back", "bingo!" and "go forward", respectively. Method
find_line then repositions the pointer in the file and calls
block again, if necessary.

Please shoot...

gegroet,
Erik V. - http://www.erikveen.dds.nl/

BTW, lines 17899 and 17900 of the downloaded file both contain
"2624585728", which is strange... (Downloaded: 2007-09-17 08:34
(UTC).)

···

----------------------------------------------------------------

$ time ruby quiz139a.rb 11.22.33.44
11.22.33.44 US

real 0m0.011s
user 0m0.006s
sys 0m0.004s

----------------------------------------------------------------

$ wc all_ip_blocks
82358 82358 905115 all_ip_blocks

$ time ruby quiz139b.rb all_ip_blocks

real 0m45.681s # That's 1800 IPs per second.
user 0m40.351s
sys 0m5.300s

----------------------------------------------------------------

class OrderedLinesFile
   def initialize(file_name)
     @file_name = File.expand_path(file_name)
   end

   def find_line(&block)
     line = nil

     File.open(@file_name) do |io|
       position = 0
       delta = File.size(@file_name)/2
       line = io.gets # The first line of the file.
       line = io.gets while /^\s*#/ =~ line or /^\s*$/ =~
line

       while delta > 0 and line and (space_ship = block.call(line)) !=
0
         position += space_ship < 0 ? -delta : +delta
         delta /= 2

         if position > 0
           io.pos = position
           broken_line = io.gets # Skip the current (broken)
line.
           line = io.gets # The current line of the
file.
           line = io.gets while /^\s*#/ =~ line or /^\s*
$/ =~ line
         else
           delta = 0 # Stop.
         end
       end

       line = nil if delta == 0 # Nothing found.
     end

     line
   end
end

class IpToCountry
   FILE = "IpToCountry.csv"

   def country_of(ip_addr)
     ip_addr = ip_addr.split(/\./).collect{|n| n.to_i}.inject{|n,
m> n*256+m} # "1.2.3.4" --> 16909060
     olf = OrderedLinesFile.new(FILE)
     res = nil

     olf.find_line do |line|
       ip_from, ip_to, registry, assigned, ctry, cntry, country =
line.gsub(/"/, "").split(/,/, 7)

       if ip_addr < ip_from.to_i
         -1 # Go back in the file.
       elsif ip_addr > ip_to.to_i
         +1 # Go forward in the file.
       else
         res = ctry
         0 # Bingo!
       end
     end

     res
   end
end

itc = IpToCountry.new

ARGV.each do |ip_addr|
   puts "%-16s %s" % [ip_addr, itc.country_of(ip_addr)||"??"]
end

----------------------------------------------------------------

Ok, here's the result of mine:

mark@server1:~/rubyquiz/139$ time ./ip2country.rb 195.135.211.255
UK

real 0m0.004s
user 0m0.004s
sys 0m0.000s

Here's my code:

#!/usr/bin/ruby
ARGV.each { |ip|
  f = ip.split(/\./).join "/"
  puts File.open(f).readlines[0] rescue puts "Unknown"
}

I think it's pretty obvious what the preparation step was. Of course,
the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not? :slight_smile:

- Mark.

Ruby Quiz wrote:

[...]

  $ time ruby ip_to_country.rb 68.97.89.187
  US
  
  real 0m0.314s
  user 0m0.259s
  sys 0m0.053s

Is an 'initialisation run' allowed to massage the data?
(we should at least split the benchmarks to keep it fair)

My script does need and initialization run, yes. I don't see any harm in paying a one time penalty to set things up right.

Is it motivating or a spoiler to post timings?

Motivating, definitely. :slight_smile:

James Edward Gray II

···

On Sep 14, 2007, at 2:20 PM, Simon Kröger wrote:

Please do. I would love to see how it stacks up against the custom solutions we will no doubt see.

James Edward Gray II

···

On Sep 15, 2007, at 2:45 AM, Clifford Heath wrote:

Ruby Quiz wrote:

This week's Ruby Quiz is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in.

I assume that it's not ok for folk to use my Ruby GeoIP gem?
It has a reasonable compromise between speed and memory usage.
:slight_smile:

This is the code I used to generate the quiz example. To give credit where credit is due though, it was heavily inspired from some code Allan Odgaard shared with me:

#!/usr/bin/env ruby -wKU

require "open-uri"
require "zlib"

begin
   require "rubygems"
rescue LoadError
   # load without gems
end

begin
   require "faster_csv"
   FCSV.build_csv_interface
rescue LoadError
   require "csv"
end

class IP
   def initialize(address)
     @address_chunks = address.split(".").map { |n| Integer(n) }
     raise AgumentError, "Malformed IP" unless @address_chunks.size == 4
   end

   def to_i
     @address_chunks.inject { |result, chunk| result * 256 + chunk }
   end

   STRING_SIZE = new("255.255.255.255").to_i.to_s.size

   def to_s
     "%#{STRING_SIZE}s" % to_i
   end
end

class IPToCountryDB
   REMOTE = "http://software77.net/cgi-bin/ip-country/geo-ip.pl?action=download"
   LOCAL = "ip_to_country.txt"

   RECORD_SIZE = IP::STRING_SIZE * 2 + 5

   def self.build(path = LOCAL)
     open(path, "w") do |db|
       open(REMOTE) do |url|
         csv = Zlib::GzipReader.new(url)

         last_range = Array.new
         csv.each do |line|
           next if line =~ /\A\s*(?:#|\z)/
           from, to, country = CSV.parse_line(line).values_at(0..1, 4).
                                   map { |f| Integer(f) rescue f }
           if last_range[2] == country and last_range[1] + 1 == from
             last_range[1] = to
           else
             build_line(db, last_range)
             last_range = [from, to, country]
           end
         end
         build_line(db, last_range)
       end
     end
   end

   def self.build_line(db, fields)
     return if fields.empty?
     db.printf("%#{IP::STRING_SIZE}s\t%#{IP::STRING_SIZE}s\t%s\n", *fields)
   end
   private_class_method :build_line

   def initialize(path = LOCAL)
     begin
       @db = open(path)
     rescue Errno::ENOENT
       self.class.build(path)
       retry
     end
   end

   def search(address)
     binary_search(IP.new(address).to_i)
   end

   private

   def binary_search(ip, min = 0, max = @db.stat.size / RECORD_SIZE)
     return "Unknown" if min == max

     middle = (min + max) / 2
     @db.seek(RECORD_SIZE * middle, IO::SEEK_SET)

     if @db.read(RECORD_SIZE) =~ /\A *(\d+)\t *(\d+)\t([A-Z]{2})\n\z/
       if ip < $1.to_i then binary_search(ip, min, middle)
       elsif ip > $2.to_i then binary_search(ip, middle + 1, max)
       else $3
       end
     else
       raise "Malformed database at offset #{RECORD_SIZE * middle}"
     end
   end
end

if __FILE__ == $PROGRAM_NAME
   require "optparse"

   options = {:db => IPToCountryDB::LOCAL, :rebuild => false}

   ARGV.options do |opts|
     opts.banner = "Usage:\n" +
                   " #{File.basename($PROGRAM_NAME)} [-d PATH] IP\n" +
                   " #{File.basename($PROGRAM_NAME)} [-d PATH] -r"

     opts.separator ""
     opts.separator "Specific Options:"

     opts.on("-d", "--db PATH", String, "The path to database file") do |path|
       options[:db] = path
     end
     opts.on("-r", "--rebuild", "Rebuild the database") do
       options[:rebuild] = true
     end

     opts.separator "Common Options:"

     opts.on("-h", "--help", "Show this message.") do
       puts opts
       exit
     end

     begin
       opts.parse!
       raise "No IP address given" unless options[:rebuild] or ARGV.size == 1
     rescue
       puts opts
       exit
     end
   end

   if options[:rebuild]
     IPToCountryDB.build(options[:db])
   else
     puts IPToCountryDB.new(options[:db]).search(ARGV.shift)
   end
end

__END__

James Edward Gray II

BTW, all solutions already submitted will lie for subnets 1,2 and 5 :slight_smile:
Most (but not all) will break on out of bounds submissions (256.256.256.256 or 0.0.0.-1, latter if comments are stripped out)

Hi, could you clarify what is meant by lying about subnets
1, 2, and 5?

Thanks,

Bill

···

From: "Eugene Kalenkovich" <rubify@softover.com>

After this:

Here's my submission. It's almost exactly the same as Jesus'.

[...]

while low <= high

[...]

and this:

BTW, all solutions already submitted will lie for subnets 1,2 and 5 :slight_smile:
Most (but not all) will break on out of bounds submissions (256.256.256.256
or 0.0.0.-1, latter if comments are stripped out)

I made a couple of modifications to my code. Here is the latest version:

require 'ftools'

ip = ARGV[0].split(/\./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 + ip[3].to_i
file = ARGV[1] || 'ipdb.csv'

File.open(file) do |f|
  low = 0
  high = f.stat.size
  f.seek(high / 2)
  while low < high
    while (((a = f.getc) != 10) && (f.pos > 2))
        f.seek(-2, IO::SEEK_CUR)
    end
    pos = f.pos
    line = f.readline.split(",")
    low_range = line[0][1..-2].to_i
    high_range = line[1][1..-2].to_i
    if (low_range > ip)
      high = pos
      offset = (f.pos - pos) + ((high - low) / 2)
      f.seek(-offset, IO::SEEK_CUR)
    elsif (high_range < ip)
      low = f.pos
      f.seek((high-low) / 2, IO::SEEK_CUR)
    else
      puts line[4][1..-2]
      exit
    end
  end
  puts "No country found"
end

I changed to while low < high, and also the walking backwards failed
when reading the first line (btw, Steve, won't your solution also fail
for that case?).

Now:

time ruby quiz139b.rb 2.1.1.1
No country found

real 0m0.030s
user 0m0.016s
sys 0m0.004s

time ruby quiz139b.rb 5.1.1.1
No country found

real 0m0.032s
user 0m0.008s
sys 0m0.008s

time ruby quiz139b.rb 1.1.1.1
No country found

real 0m0.033s
user 0m0.016s
sys 0m0.000s

time ruby quiz139b.rb 256.256.256.256
No country found

real 0m0.096s
user 0m0.008s
sys 0m0.000s

Thanks,

Jesus.

···

On 9/16/07, steve d <oksteev@yahoo.com> wrote:
On 9/16/07, Eugene Kalenkovich <rubify@softover.com> wrote:

I just performed the tests another time on my desktop computer with a freshly installed Sabayon Linux. I don't know why this one performs that much better -- it's gentoo-based, it's the same file system, but it has no SCSI controller - maybe that makes it faster? Or is it the dual-core ... anyway.

It's about twice as fast and requires a fraction of the "sys" times. Also the "empty" ruby run is only 5 ms. Mysteriously.

The packer is down at 0.8s, and 100000 lookups are at 6.2 s (file-based aka memory efficient) and 3.7 s (everything in-memory).

Anyway -- i'd like to see a 100000 lookups comparison :slight_smile: *hehe*

Matthias Wächter schrieb:

My results for 100000 lookups are:

$ time ./packer.rb

real 0m1.634s
user 0m1.576s
sys 0m0.056s

$ time ./packer.rb

real 0m0.716s
user 0m0.708s
sys 0m0.008s

$ time ./rand_ip.rb 100000 byline > ip-test-list

real 0m0.797s
user 0m0.740s
sys 0m0.056s

$ time ./rand_ip.rb 100000 byline > ip-test-list

real 0m0.322s
user 0m0.317s
sys 0m0.005s

$ time ./search_file.rb < ip-test-list > /dev/null

real 0m11.091s
user 0m9.673s
sys 0m1.420s

$ time ./search_file.rb <ip-test-list >/dev/null

real 0m6.201s
user 0m5.316s
sys 0m0.885s

$ time ./search_str.rb < ip-test-list > /dev/null

real 0m7.131s
user 0m6.960s
sys 0m0.168s

$ time ./search_str.rb <ip-test-list >/dev/null

real 0m3.714s
user 0m3.707s
sys 0m0.006s

btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be!
$ time ruby /dev/null

real 0m0.129s
user 0m0.120s
sys 0m0.012s

$ time ruby /dev/null

real 0m0.005s
user 0m0.004s
sys 0m0.001s

$ uname -a
Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux

$ uname -a
Linux sabayon2me 2.6.22-sabayon #1 SMP Mon Sep 3 00:33:06 UTC 2007 x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux

$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

- Matthias

Pretty clever.

I bet with the right prep, this could even be a pretty viable approach. Instead of building a file for each address you could create a directory structure for the hexadecimal representations of each piece of the address. The final layer could be handled as you have here or with a search through a much smaller file.

James Edward Gray II

···

On Sep 19, 2007, at 10:15 AM, Mark Thomas wrote:

Ok, here's the result of mine:

mark@server1:~/rubyquiz/139$ time ./ip2country.rb 195.135.211.255
UK

real 0m0.004s
user 0m0.004s
sys 0m0.000s

Here's my code:

#!/usr/bin/ruby
ARGV.each { |ip|
  f = ip.split(/\./).join "/"
  puts File.open(f).readlines[0] rescue puts "Unknown"
}

I think it's pretty obvious what the preparation step was. Of course,
the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not? :slight_smile:

James Edward Gray II wrote:

···

On Sep 14, 2007, at 2:20 PM, Simon Kröger wrote:

Ruby Quiz wrote:

[...]

    $ time ruby ip_to_country.rb 68.97.89.187
    US
    
    real 0m0.314s
    user 0m0.259s
    sys 0m0.053s

Is an 'initialisation run' allowed to massage the data?
(we should at least split the benchmarks to keep it fair)

My script does need and initialization run, yes. I don't see any harm
in paying a one time penalty to set things up right.

Is it motivating or a spoiler to post timings?

Motivating, definitely. :slight_smile:

James Edward Gray II

Ok, my script does not need any initialization, it uses the file
IpToCountry.csv exactly as downloaded.

----------------------------------------------------------------
$ ruby -v
ruby 1.8.4 (2005-12-24) [i386-cygwin]

$ time ruby quiz139.rb 68.97.89.187
US

real 0m0.047s
user 0m0.030s
sys 0m0.030s

$ time ruby quiz139.rb 84.191.4.10
DE

real 0m0.046s
user 0m0.046s
sys 0m0.015s
----------------------------------------------------------------

This is on a Pentium M 2.13GHz Laptop with 2GB RAM and rather slow HD.

cheers

Simon

"Bill Kelly" <billk@cts.com> wrote in message
news:00b301c7f897$a8c9dc20$6442a8c0@musicbox...

From: "Eugene Kalenkovich" <rubify@softover.com>

BTW, all solutions already submitted will lie for subnets 1,2 and 5 :slight_smile:
Most (but not all) will break on out of bounds submissions
(256.256.256.256 or 0.0.0.-1, latter if comments are stripped out)

Hi, could you clarify what is meant by lying about subnets
1, 2, and 5?

Check what ccountry is 5.1.1.1. If you get any valid answer - this answer
is a lie :slight_smile: