the keys were around 32 bytes, the values around 256. the format was custom,
basically the key was null terminated and the lists was null terminated and
comma separated (commas were not allowed in list elements). the updates were
very infrequent and i ended up writing a ruby script that read a text file
representing the data, turned it into c programs which declared the data
structure in an ordered way, and compiled this into a shared libaray that
included a custom search function. it was kindof an object oriented approach
in that the data and functions were together and programs took advantage of
the fact the libs are mmap'd into memory to searched read no more than they
need to find any element. i actually found my weekly report from that week
and i included it below...
cheers.
···
On Wed, 7 Sep 2005, Pit Capitain wrote:
Ara.T.Howard schrieb:
on a related note, i did some research into lookups in c using hahses
verses sorted arrays/bsearch; much to my suprise (as a computer
scientist) i found that, with the exception of HUGE (millions) of
entries lookup by bsearch was and order of magnitude faster than any
hashing mechanism i could find. my test looked at cdb, hsearch, glib
hashing functions, gperf and, for bsearch, the c library bsearch.
profiling the different programs showed that the reason the bsearch
was faster was speedy was due to lack of function calls and this is
easy to imagine - flipping a pointer around memory with only one stack frame is about as lightweight as one can get... anyhow, my
tests were very specific to my application but interesting
nonetheless - thought you'd be interested.
Interesting. What types of keys and values did you use?
--
email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
29.07.2002
______________________________________________________________________________
<snip>
Task 567 35 hrs - Metadata
* spent a few days running **exhaustive** analysis of several search
methodologies for libmm
this all started because i was amazed at the speed which my search ran
brief synopsis of tested methods, all methods did 65536 lookups of every
icao name, averaging the results -- storage formats varied with method,
but all mapped an icao name to it's position in a table (int value)
#####################################################################
# METHOD : sqlite
#####################################################################
-------------------------------------------------------------------
AVG TIME FOR 65536 LOOKUPS
32.623693s
-------------------------------------------------------------------
#####################################################################
# METHOD : berkely db (latest version), hash db type
#####################################################################
-------------------------------------------------------------------
AVG TIME FOR 65536 LOOKUPS
1.260673268s
-------------------------------------------------------------------
#####################################################################
# METHOD : gperf function encoded into shareable library
#####################################################################
-------------------------------------------------------------------
AVG TIME FOR 65536 LOOKUPS
0.1376355546
-------------------------------------------------------------------
#####################################################################
# METHOD : libmm roll-my-own binary search
#####################################################################
-------------------------------------------------------------------
AVG TIME FOR 65536 LOOKUPS
0.01562877542
-------------------------------------------------------------------
i also tested a hashtab c library which i plan on using to accomplish
lookups when the db (shared lib) is not available and the cfg file itself
need be parsed -- note that these last tests do not include the time to
popoulate the db, all above cases are working with pre populated db's...
#####################################################################
# METHOD : lib hashtab
#####################################################################
-------------------------------------------------------------------
AVG TIME FOR 65536 LOOKUPS
0.03334630323
-------------------------------------------------------------------
#####################################################################
# METHOD : a similar test using stl std::map<std::string, int>
#####################################################################
-------------------------------------------------------------------
AVG TIME FOR 65536 LOOKUPS
0.242236081
-------------------------------------------------------------------
there's alot i'm not mentioning here like, easy of use, maintainance time,
how each storage format could be updated and synchroized with remote nodes,
scalability, available language api s, ease of understanding, reliance on
outside packages, ease of use, etc -- which i've also been factoring in to
the design choices.
however, the one real conclusion i came to was that not only does my search
facility provide searches for lt, gt, le, ge, eq, floor, and ceiling, it
does so at very good performance (best log(n), worst log(n)) -- i was
generally suprised that that gperf-sharable lib method was not faster, and
that the speed differential (order of magnitude) remained constant up to
searches over large (rows numbering 65536) tables (i did no testing above
this number) and was even a little worse for small (like grib) tables! it
must be the case that the time for hash based lookups (best 1, worst n)
under these test conditions does not out perform my simple binary search on
a sorted shared library stored array of index structs :
struct index { char *key, int idx};
or simply that the function call overhead (three for gperf) one for
mm_search is responsible. in anycase, i can use yet another third party
peice of software to generate a search function in the modules, or create a
simple table of two element structs (done by cfg2c at compile time -- not
runtime) and associated search function (mm_search). both methods result in
a virtual memory resident searchable embedded database (no page pool, os
responsible for paging -- e.q. db should fit into real memory for good
performane) shared by all processes on the same system; gperf slows down the
search by about a factor of ten and slightly simplifies code generation --
mm_search is super fast, is only slightly more complicated (but only because
gperf generates the hashing function automatically!!) to code, and adds no
additional requirements on the system.
it seems like my first hack at this problem (searching libmm tables) is the
fastest and easiest overall... this was *really* a suprise -- like having
something compile the first time... something must be wrong
</snip>