Database-like Algorithmic Problem

hi all,

ok here's the scoop. i have a collection of hashes all with an arbitrary
number of symbolic keys pointing to arbitrary objects. i have a template
hash also containing symbol keys pointing to 'matching objects', i.e.
objects to be matched with the objects in the aforementioned collection,
and i want to pull out the first match. an example:

# please don't psychoanalyze the below :slight_smile:
collection = [
聽聽{ :foo => 'python', :bar => 'php', :x => 69 },
聽聽{ :foo => 'scheme', :mama => 'ruby', :papa => 'C' },
聽聽{ :x => 386 },
聽聽{ :mama => 'ruby', :papa => 'simula', :lil_bro => 'fortran' }
]

template1 = { :mama => 'ruby', :papa => 'C' }
template2 = { :x => Numeric }

match_first(template1, collection)
# should return: { :foo => 'scheme', :mama => 'ruby', :papa => 'C' }
match_all(template2, collection)
# should return: [{:x => 386 }, {:foo => 'python', :bar => 'php', :x =>
69}]

so obviously === is being used internally as is evidenced by Numeric in
template2, but what kind of datastructures/algorithms should i be
looking at for the most efficient (time-wise) implementation, keeping in
mind, there may be many hashes in the collection and those hashes are
generally not very large, i.e. contain few key/value pairs. i'm not too
savy on the taxonomy of algorithms so any pointers to what i should be
looking at to read up on would be appreciated.

my potentially naive solution is something like the following (in full
color http://pastie.caboo.se/130033):

require 'set'

@collection = {}

def add hash
聽聽hash.each do |key,value|
聽聽聽聽(@collection[key] ||= Set.new) << hash
聽聽end
end

def candidates template
聽聽sets = []
聽聽template.keys.each do |key|
聽聽聽聽sets << @collection[key] if @collection.include? key
聽聽end
聽聽if sets.empty?
聽聽聽聽return sets
聽聽else
聽聽聽聽return sets.inject(sets.pop) do |intersection, set|
聽聽聽聽聽聽intersection & set
聽聽聽聽end
聽聽end
end

def match template, hash
聽聽template.each do |key, match|
聽聽聽聽return false unless match === hash[key]
聽聽end
聽聽return true
end

def match_first template
聽聽candidates(template).each do |hash|
聽聽聽聽return hash if match(template, hash)
聽聽end
聽聽return nil
end

def match_all template
聽聽return candidates(template).select do |hash|
聽聽聽聽match(template, hash)
聽聽end
end

[
聽聽{ :foo => 'python', :bar => 'php', :x => 69 },
聽聽{ :foo => 'scheme', :mama => 'ruby', :papa => 'C' },
聽聽{ :x => 386, :papa => 'clause' },
聽聽{ :mama => 'ruby', :papa => 'simula', :lil_bro => 'fortran' },
聽聽{ :mama => 1946, :boo => 'far', :x => 'x'}
].each do |hash|
聽聽add hash
end

puts match_first({ :mama => 'ruby', :papa => 'C' }).inspect
puts match_all({:foo => String}).inspect

thanks for reading this far!
_c

路路路

--
Posted via http://www.ruby-forum.com/.

ok here's the scoop. i have a collection of hashes all with an arbitrary
number of symbolic keys pointing to arbitrary objects. i have a template
hash also containing symbol keys pointing to 'matching objects', i.e.
objects to be matched with the objects in the aforementioned collection,
and i want to pull out the first match. an example:

# please don't psychoanalyze the below :slight_smile:

You are crazy. Please contact a doctor ASAP. Oh, sorry. :slight_smile:

collection = [
  { :foo => 'python', :bar => 'php', :x => 69 },
  { :foo => 'scheme', :mama => 'ruby', :papa => 'C' },
  { :x => 386 },
  { :mama => 'ruby', :papa => 'simula', :lil_bro => 'fortran' }
]

template1 = { :mama => 'ruby', :papa => 'C' }
template2 = { :x => Numeric }

match_first(template1, collection)
# should return: { :foo => 'scheme', :mama => 'ruby', :papa => 'C' }
match_all(template2, collection)
# should return: [{:x => 386 }, {:foo => 'python', :bar => 'php', :x =>
69}]

so obviously === is being used internally as is evidenced by Numeric in
template2, but what kind of datastructures/algorithms should i be
looking at for the most efficient (time-wise) implementation, keeping in
mind, there may be many hashes in the collection and those hashes are
generally not very large, i.e. contain few key/value pairs. i'm not too
savy on the taxonomy of algorithms so any pointers to what i should be
looking at to read up on would be appreciated.

my potentially naive solution is something like the following (in full
color http://pastie.caboo.se/130033\):

<snip/>

thanks for reading this far!

I think you are on a pretty good way. Here's what I would do differently:

1. Do no recreate all those candidate collections for every
match_first call. Instead, stuff everything into a class and provide a
method that will create an /index/ of the data (see 2).

2. I would use Hashes to store your indexes, i.e. instead of creating
candidates all over again, create them once (see 1) and keep them
there.

Kind regards

robert

路路路

2007/12/18, Christophe Mckeon <chromatophore@gmail.com>:

--
use.inject do |as, often| as.you_can - without end

Does it please you to believe that I please don't psychoanalyze the below?

路路路

On Tue, 18 Dec 2007 10:54:54 -0500, Christophe Mckeon wrote:

# please don't psychoanalyze the below :slight_smile:

--
Jay Levitt |
Boston, MA | My character doesn't like it when they
Faster: jay at jay dot fm | cry or shout or hit.
http://www.jay.fm | - Kristoffer

You are crazy. Please contact a doctor ASAP. Oh, sorry. :slight_smile:

:slight_smile:

I think you are on a pretty good way. Here's what I would do
differently:

...

thanks robert. yeah i'd like to do some caching, but as hashes are
constantly
added and removed from the collection, i'll have to weigh up the
overhead of managing a cache, maybe even implement it and do some
profiling w/ and w/o. guess i'll have to read up on cache algs.

regards,
_c

路路路

--
Posted via http://www.ruby-forum.com/\.

Interesting problem. Aside from seeing a doctor, my suggestions are in
the comments in the revised variant below:

路路路

---

require 'set'

# encapsulate
class DB

  # use hash's default syntax
  def initialize
    @c = Hash.new{ | hash, key | hash[key] = Set.new }
  end

  # was: add, use << to follow ruby idiom for array, set
  # store val with hash for quicker comparison
  def <<(hash)
    hash.each do |key,val|
      @c[key] << [ val, hash ]
    end
  end

  # need a delete operation ...
  def delete(hash)
    @c.delete(hash)
  end

  # was: match_all, use select to follow ruby enumerable idiom
  # determine candidates on the fly, avoid intersection code using Set
  def select(template)
    good = Set.new; bad = Set.new
    template.each do |key,val1|
      (@c[key]).each do |val2,hash|
        ( val1 === val2 ? good : bad ) << hash unless bad.include?
hash
      end
    end
    return good.to_a
  end

  # was: match_first, use select to follow ruby enumerable idiom
  # determine candidates on the fly, avoid intersection code using Set
  def find(template)
    good = Set.new; bad = Set.new
    template.each do |key,val1|
      (@c[key]).each do |val2,hash|
        ( val1 === val2 ? good : bad ) << hash unless bad.include?
hash
      end
      return good.to_a.first unless good.empty?
    end
    return nil
  end
end

---

HTH!

Dan Yoder
http://dev.zeraweb.com/

On Dec 18, 9:45 am, Christophe Mckeon <chromatoph...@gmail.com> wrote:

> You are crazy. Please contact a doctor ASAP. Oh, sorry. :slight_smile:

:slight_smile:

> I think you are on a pretty good way. Here's what I would do
> differently:

> ...

thanks robert. yeah i'd like to do some caching, but as hashes are
constantly
added and removed from the collection, i'll have to weigh up the
overhead of managing a cache, maybe even implement it and do some
profiling w/ and w/o. guess i'll have to read up on cache algs.

regards,
_c

--
Posted viahttp://www.ruby-forum.com/.

cruiserdan wrote:

class DB
...
end

thanks for that, i benchmarked it and it is many times faster. now i
have to see how i'm going to do caching etc..

_c

路路路

--
Posted via http://www.ruby-forum.com/\.

Christophe Mckeon wrote:

cruiserdan wrote:

class DB
...
end

thanks for that, i benchmarked it and it is many times faster. now i
have to see how i'm going to do caching etc..

_c

oops, i read my benchmarks wrong, it is actually not *that* much faster

yours: 9.020000 0.940000 9.960000 ( 11.283543)
mine: 9.220000 1.380000 10.600000 ( 11.127394)

but is an improvement, so thanks. i'll do some more comprehensive
benchmarks
and post them here in a bit.

路路路

--
Posted via http://www.ruby-forum.com/\.

there's a bug in yours. not sure where yet. try

d = DB.new
d << {:a => 2, :b => 1, :c => 'b'}
puts d.find({ :a => Numeric, :b => 'a' }).inspect

returns {:a => 2, :b => 1, :c => 'b'}

路路路

--
Posted via http://www.ruby-forum.com/.

Okay, here's a fixed version.

There were 2 problems. The first was that I forgot to remove the bad
(unmatched) hashes before returning a result. The second was that you
have to go through every key in the template to find a match (I was
returning after the first match). Thus find is no more efficient than
select due to the way the data is stored.

I also took out the 'unless bad.include? hash' clause from the compare
because I think it actually is slower that way, depending on what you
are comparing. I thought it would be faster than the original version
posted because I don't have to generate a set of candidates first, so
the initial benchmarks were disappointing. But maybe this version will
do even better. I look forward to seeing your results.

路路路

---

  # new version of select
  def select(template)
    good = Set.new; bad = Set.new
    template.each do |key,val1|
      (@c[key]).each do |val2,hash|
        ( val1 === val2 ? good : bad ) << hash # unless bad.include?
hash
      end
    end
    return (good - bad).to_a
  end

  # new version of find
  def find(template)
    r = select template
    r.first unless r.empty?
  end

---

Regards,
Dan

On Dec 19, 2:23 pm, Christophe Mckeon <chromatoph...@gmail.com> wrote:

there's a bug in yours. not sure where yet. try

d = DB.new
d << {:a => 2, :b => 1, :c => 'b'}
puts d.find({ :a => Numeric, :b => 'a' }).inspect

returns {:a => 2, :b => 1, :c => 'b'}

--
Posted viahttp://www.ruby-forum.com/.