Binary Tree vs. Hash

“Dave Thomas” dave@pragprog.com wrote in message
news:3ED24D67.9060300@pragprog.com

MikkelFJ wrote:

Perhaps this is an excercise for a new kata?

Or for the original one… :slight_smile:

http://pragprog.com/pragdave/Practices/CodeKata

This is why you should hang out in comp.lang.ruby, never mind the language,
but I’ve yet to find a place where references to interesting technology
spreads faster.

I’d have to get back on the bit counting thing, guess I must read the
article.
Are you aware that there are efficient boolean operations for counting bits.
They are anything but intuitive, but someone figured out the instruction
sequences and listed them on the internet somewhere that has slipped me
momentarily. (Something like cryptic like (0xaeaeaeae and myword) or
0x121212121 …).

Mikkel

Dave Thomas wrote:

A Bloom filter reaches optimal performance when there an equal
number of odd and even bits set (each probe into the bitmap
therefore yields maximum selectivity)
Actually, it’s worse that that.

You’re right, I forgot to divide by two (for half the bits set).
The discovery that half the bits should be set seems obvious now,
but it was a major “aha!” moment for me in 1985…

Interestingly, the same tables suggest that for a 200k hash, m/n will be
roughly 18, so k optimizes out at 12 or 13 hashes

I only had a 45,000 word list though, so k was higher.
The 0.6185 heuristic assumes a perfectly random hash.
I tested many different hash algs to find the cheapest
that was close to random.

Clifford.

That’s not very good though; I mean, just encoding each letter using 5 bits,
I reckon a 90,000 word dictionary might take under 600K [*]. And then you
have none of that “dangerous messing around with improbabilities” (as I
think Douglas Adams wrote). Admittedly it wouldn’t be quick to search though.

Regards,

Brian.

[*]
$ wc /usr/share/dict/words
235881 235881 2493066 /usr/share/dict/words

So I’m guessing a 90,000 word dictionary would have
90000/235881 * 2493066 bytes
Encoding each character and newline as a 5-bit pattern gives 5/8ths of that,
or 594516 bytes, or about 580K.

···

On Tue, May 27, 2003 at 12:16:04PM +0900, Dave Thomas wrote:

Interestingly, the same tables suggest that for a 200k hash, m/n will be
roughly 18, so k optimizes out at 12 or 13 hashes, for an error rate of
0.000176, or 1/5681, which probably still isn’t good enough for
real-world use. It looks like about 300k is probably a good starting
point.

it does not - but it certainly can be. i’ve often wondered why Array does not
carry an internal @sorted flag - which could be used as a hint for certain
operations - like Array#include?

-a

···

On Mon, 26 May 2003, Brian Candler wrote:

It doesn’t, because Array doesn’t have any requirement to be sorted.

Ara Howard
NOAA Forecast Systems Laboratory
Information and Technology Services
Data Systems Group
R/FST 325 Broadway
Boulder, CO 80305-3328
Email: ara.t.howard@fsl.noaa.gov
Phone: 303-497-7238
Fax: 303-497-7259
~ > ruby -e ‘p % ^) .intern’
====================================

Mon, 26 May 2003 18:54:24 +0900, Michael Neumann wrote:
[snip Array lookup problem]

ruby -r complex -e ‘c,m,w,h=Complex(-0.75,0.136),50,150,100;puts “P6\n#{w}
#{h}\n255”;(0…h).each{|j|(0…w).each{|i|
n,z=0,Complex(.9i/w,.9j/h);while n<=m&&(z-c).abs<=2;z=zz+c;n+=1 end;print
[10+n
15,0,rand99].pack("C")}}’|display

Now that is a nice hack. And you’ve just reminded me about a bunch of
unwritten fractal-related small apps I planned to recode in Ruby. Thank you
:->

···


A moose once bit my sister

Hi,

You could try ‘GC.start’ to do garbage collection, but I don’t know if Ruby
then un-mallocs the storage and returns it to the O/S.

I did tried GC.start, but it had ABSOLUTELY no effect. if I put the
program in background for a while (at least 30 min, for example), it
then shrinked memory usage automatically. I am using windows 2000, and I
have a fairly big memory (320M). I wonder if I have more memory, the
Ruby GC get lazier? :slight_smile:

Thanks.
Shannon

···

On Mon, 26 May 2003 22:32:20 +0900 Mauricio Fernández batsman.geo@yahoo.com wrote:

“ahoward” ahoward@fsl.noaa.gov schrieb im Newsbeitrag
news:Pine.LNX.4.53.0305271414370.28367@eli.fsl.noaa.gov

It doesn’t, because Array doesn’t have any requirement to be sorted.

it does not - but it certainly can be. i’ve often wondered why Array
does not
carry an internal @sorted flag - which could be used as a hint for
certain
operations - like Array#include?

IMHO that’s overly complicating Array. An Array is simply a sequence of
items. Period.

To have a sorted array I’d have either

  • a sub class of Array (like SortedArray) that overrides modifying
    operations to ensure this invariant (or do it dependend on the flag you
    mentioned) or

  • a SortedArray delegating all non modifying calls to an array and
    overriding modifying methods.

I think this is the more component oriented approach.

Regards

robert
···

On Mon, 26 May 2003, Brian Candler wrote:

“MikkelFJ” mikkelfj-anti-spam@bigfoot.com wrote in message
news:3ed255d7$0$97199$edfadb0f@dread12.news.tele.dk…

Perhaps this is an excercise for a new kata?

Or for the original one… :slight_smile:

http://pragprog.com/pragdave/Practices/CodeKata

Are you aware that there are efficient boolean operations for counting
bits.
They are anything but intuitive, but someone figured out the instruction
sequences and listed them on the internet somewhere that has slipped me
momentarily. (Something like cryptic like (0xaeaeaeae and myword) or
0x121212121 …).

Here’s some source for counting bits. I picked it from an ideal hash table.

// from http://www.jjj.de/bitwizardry/files/bitcount.h

static inline ulong bit_count(ulong x)
{
#if BITS_PER_LONG == 32
x -= (x>>1) & 0x55555555;
x = ((x>>2) & 0x33333333) + (x & 0x33333333);
x = ((x>>4) + x) & 0x0f0f0f0f;
x *= 0x01010101;
return x>>24;
#endif

#if BITS_PER_LONG == 64
x = ((x>>1) & 0x5555555555555555) + (x & 0x5555555555555555); // 0-2 in
2 bits
x = ((x>>2) & 0x3333333333333333) + (x & 0x3333333333333333); // 0-4 in
4 bits
x = ((x>>4) + x) & 0x0f0f0f0f0f0f0f0f; // 0-8 in
4 bits
x += x>> 8; // 0-16
in 8 bits
x += x>>16; // 0-32
in 8 bits
x += x>>32; // 0-64
in 8 bits
return x & 0xff;
#endif
}

Mikkel

You can find the expanded, more verbose source code here:

http://www.ntecs.de/blog/Tech/ComputerScience/JuliaSets.html

Regards,

Michael

···

On Mon, Jun 02, 2003 at 09:22:07AM +0900, Wojciech Kaczmarek wrote:

Mon, 26 May 2003 18:54:24 +0900, Michael Neumann wrote:
[snip Array lookup problem]

ruby -r complex -e ‘c,m,w,h=Complex(-0.75,0.136),50,150,100;puts “P6\n#{w}
#{h}\n255”;(0…h).each{|j|(0…w).each{|i|
n,z=0,Complex(.9i/w,.9j/h);while n<=m&&(z-c).abs<=2;z=zz+c;n+=1 end;print
[10+n
15,0,rand99].pack("C")}}’|display

Now that is a nice hack. And you’ve just reminded me about a bunch of
unwritten fractal-related small apps I planned to recode in Ruby. Thank you
:->

IMHO that’s overly complicating Array. An Array is simply a sequence of
items. Period.

on paper - sure. but in reality we have:

new & * + – << <=> == === = | assoc at clear
collect collect! compact compact! concat delete delete_at
delete_if each each_index empty? eql? fill first flatten
flatten! include? index indexes indices join last length map!
nitems pack pop push rassoc reject! replace reverse reverse!
reverse_each rindex shift size slice slice! sort sort! to_a
to_ary to_s uniq uniq! unshift

mixins
Enumerable:
collect, detect, each_with_index, entries,
find, find_all, grep, include?, map, max,
member?, min, reject, select, sort, to_a

much more than a sequence of items.

what would be the issue if Enumerable was optimized for containers, like
Array, which could support random access and binary searches? the same
methods would exist - they would simply be faster. admittedly this is less
‘general’ but optimization for such a heavily used container as the Array
makes sense IHMO, espcially when it could be done in such a way to be
completely backward compatible while giving a speed increase to practially
every ruby program ever written…

-a

To have a sorted array I’d have either

  • a sub class of Array (like SortedArray) that overrides modifying
    operations to ensure this invariant (or do it dependend on the flag you
    mentioned) or

  • a SortedArray delegating all non modifying calls to an array and
    overriding modifying methods.

I think this is the more component oriented approach.

Regards

robert

-a

···

On Tue, 27 May 2003, Robert Klemme wrote:

====================================

Ara Howard
NOAA Forecast Systems Laboratory
Information and Technology Services
Data Systems Group
R/FST 325 Broadway
Boulder, CO 80305-3328
Email: ara.t.howard@fsl.noaa.gov
Phone: 303-497-7238
Fax: 303-497-7259
~ > ruby -e ‘p % ^) .intern’
====================================

Are you aware that there are efficient boolean operations for counting

Um, yes - I reference and use them in the article…

Dave

“ahoward” ahoward@fsl.noaa.gov schrieb im Newsbeitrag
news:Pine.LNX.4.53.0305271517520.28367@eli.fsl.noaa.gov

IMHO that’s overly complicating Array. An Array is simply a sequence
of
items. Period.

on paper - sure. but in reality we have:

new & * + – << <=> == === = | assoc at clear
collect collect! compact compact! concat delete delete_at
delete_if each each_index empty? eql? fill first flatten
flatten! include? index indexes indices join last length map!
nitems pack pop push rassoc reject! replace reverse reverse!
reverse_each rindex shift size slice slice! sort sort! to_a
to_ary to_s uniq uniq! unshift

mixins
Enumerable:
collect, detect, each_with_index, entries,
find, find_all, grep, include?, map, max,
member?, min, reject, select, sort, to_a

much more than a sequence of items.

Well, yes. But all these actions don’t introduce a new class invariant.
Note, that methods sort and sort! must be invoked explicitely. Under the
hood it remains a “sequence of things” and it is not a “sorted sequence of
things”. This is a major difference although it might not look that way
on first glance. If you consider, how all modifying methods must change
you’ll might easier see the impact of this difference. Having all methods
of this kind check a flag is commonly considered bad practice. I’d favour
an extra type, i.e. sub class or proxy that delegates.

Apart from that, since instances of a class are not comparable by default
(i.e. implement <=>) the SortedArray either would have to ensure that only
comparable elements go into the container or make up an artifical
comparison. This checks would have to be employed, if the sorted flag
changes from false to true. Then what do you do? Kick those out that
don’t match? If you make the flag read only, i.e. only set during
construction, you are better off having a different type for this.

IMHO a flag like you suggested is not sufficient: maybe you want to apply
different sorting criteria for the same type. Java solved this elegantly
by defining an interface Comparator that performs the comparison
operation. You can then create a sorted collection either with default
sorting (uses Comparable, same in ruby) or you implement a comparator on
your own. This is the most modular, flexible and clean solution.

what would be the issue if Enumerable was optimized for containers, like
Array, which could support random access and binary searches? the same
methods would exist - they would simply be faster. admittedly this is
less
‘general’ but optimization for such a heavily used container as the
Array
makes sense IHMO, espcially when it could be done in such a way to be
completely backward compatible while giving a speed increase to
practially
every ruby program ever written…

I have several objections:

  • The speed increase would not be there automatically, but one would have
    to explicitely set the sorting flag; so there’s no automatic performance
    increase.

  • You can’t use sorting by default because a lot of applications rely on
    the data not being sorted behind the scenes and

  • there’s a performance penalty to pay for sorting, which you clearly
    don’t want if you don’t need sorted data.

  • You suggested optimization is the other way round: Enumerable is the
    general implementation just requiring an implementation of each. All
    containers that can do include? and others better than the default
    implementation override those methods by their own implementation.

Sorry, if I’m beeing a bit massive (lots of text), but I think there are
serious reasons to not change Array to support automated internal sorting.
However, introducing such a type is surely not a bad idea. Maybe we
simply adopt the pattern from Java. I think, they did a good design job
on this.

Regards

robert
···

On Tue, 27 May 2003, Robert Klemme wrote:

“Dave Thomas” dave@pragprog.com wrote in message
news:3ED3DFE2.2090205@pragprog.com

Are you aware that there are efficient boolean operations for counting

Um, yes - I reference and use them in the article…

Damn … must read more carefully … I must have skipped way down or
something…??
Thats what you get when you have an attention span of a kitten catching
flies.

BTW: I’ve found the article on finite state machines as tries - low memory
overhead, in particular, the fsm can be updated incrementally and it is fast
to do so:

Incremental Construction of Minimal Acyclic Finite-State Automata

How to squeeze a lexicon
http://sun.iinf.polsl.gliwice.pl/~mciura/lexicon.pdf

http://www.google.com/search?sourceid=navclient&hl=da&ie=UTF-8&oe=UTF-8&q=Da
ciuk+minimal+tries

(Robert Feldt posted a link here at c.l.r. a year back)

[snip discussion of “sorted” flag for arrays]

I can see both sides of this. I think the
original idea has some merit.

As I see it, the @sorted flag would be set
only when a sort was done, and it would be
unset when any operation was done NOT
guaranteeing preservation of sorting.

This would make me want an “insert” operation
that would add an element and preserve the
sorting. (slice already preserves the state,
since it operates only on contiguous items.)

One problem I see is the potential ambiguity
of “sorted” – we can apply any sorting
technique we like to an array, right? And not
all of these are conducive to a speedup unless
we do an internal Schwartzian transform.

An overall better approach might be a SortedArray
inheriting from Array – it could even have its own
definitions of << and so on that would preserve
sorting. And the method of sorting would (could/
should/might?) be fixed on an instance basis.

Just my $0.02,
Hal

···

----- Original Message -----
From: “Robert Klemme” bob.news@gmx.net
Newsgroups: comp.lang.ruby
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Tuesday, May 27, 2003 11:49 AM
Subject: Re: Binary Tree vs. Hash

And presumably if you insert into a SortedArray then it needs to put the
item in the correct place. So syntax such as

a[3] = "hello"

really doesn’t make any sense. However, a << “hello” would put the entry in
the correct place.

(That’s unless you maintain this flag saying “the array is currently sorted”
which is a bit of a nightmare IMO, for lots of reasons)

So a SortedArray would have considerably different semantics to a normal
Array anyway. Its implementation would probably be different too - perhaps a
heap or a b-tree.

Regards,

Brian.

···

On Wed, May 28, 2003 at 01:49:53AM +0900, Robert Klemme wrote:

Well, yes. But all these actions don’t introduce a new class invariant.
Note, that methods sort and sort! must be invoked explicitely. Under the
hood it remains a “sequence of things” and it is not a “sorted sequence of
things”. This is a major difference although it might not look that way
on first glance. If you consider, how all modifying methods must change
you’ll might easier see the impact of this difference. Having all methods
of this kind check a flag is commonly considered bad practice. I’d favour
an extra type, i.e. sub class or proxy that delegates.

We cannot optimize w/ the @sorted flag because Array is used for two
different purposes:

  • poor man’s set (sorting would help here for include? and the like)
  • plain old array (you don’t want sorting as you expect to find one
    element at the index you inserted it)

Unless we create 2 classes specifically for these 2 different semantics,
nothing can be done.

So if anything, the first step would be creating a Set class. Set
objects could be hinted on creation whether to use internally a
sorted array, a hash ( object => true ), etc, depending on the desired
complexity and space/time trade-off.

···

On Wed, May 28, 2003 at 01:49:53AM +0900, Robert Klemme wrote:

“ahoward” ahoward@fsl.noaa.gov schrieb im Newsbeitrag
news:Pine.LNX.4.53.0305271517520.28367@eli.fsl.noaa.gov

On Tue, 27 May 2003, Robert Klemme wrote:

IMHO that’s overly complicating Array. An Array is simply a sequence
of
items. Period.

on paper - sure. but in reality we have:


_ _

__ __ | | ___ _ __ ___ __ _ _ __
'_ \ / | __/ __| '_ _ \ / ` | ’ \
) | (| | |
__ \ | | | | | (| | | | |
.__/ _,
|_|/| || ||_,|| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

  • JHM wonders what Joey did to earn “I’d just like to say, for the record,
    that Joey rules.”
    – Seen on #Debian

I’m not sure that a SortableArray is substitutable for a normal array.
When I use an array, I expect the following …

a[5] = 10
assert_equal 10, a[5]

A sorted array might not even support the = operation, or if it does,
it would have to move the inserted to its correct position (and violate
the expectation above).

If you are considering sorted containers, I suggest a good look at
either the Smalltalk library, or the Eiffel library to see how they
approach the problem. Both are well designed.

···

On Tue, 2003-05-27 at 17:35, Hal E. Fulton wrote:

An overall better approach might be a SortedArray
inheriting from Array – it could even have its own
definitions of << and so on that would preserve
sorting. And the method of sorting would (could/
should/might?) be fixed on an instance basis.


– Jim Weirich jweirich@one.net http://jimweirich.umlcoop.net

“Beware of bugs in the above code; I have only proved it correct,
not tried it.” – Donald Knuth (in a memo to Peter van Emde Boas)

“Hal E. Fulton” hal9000@hypermetrics.com schrieb im Newsbeitrag
news:064901c32486$95db4a20$0300a8c0@austin.rr.com

From: “Robert Klemme” bob.news@gmx.net
Newsgroups: comp.lang.ruby
To: “ruby-talk ML” ruby-talk@ruby-lang.org
Sent: Tuesday, May 27, 2003 11:49 AM
Subject: Re: Binary Tree vs. Hash

[snip discussion of “sorted” flag for arrays]

I can see both sides of this. I think the
original idea has some merit.

As I see it, the @sorted flag would be set
only when a sort was done, and it would be
unset when any operation was done NOT
guaranteeing preservation of sorting.

I had a different understanding, but maybe I was wrong. Must reread the
thread…

One problem I see is the potential ambiguity
of “sorted” – we can apply any sorting
technique we like to an array, right? And not
all of these are conducive to a speedup unless
we do an internal Schwartzian transform.

Yeah, that’s a major point, as I tried to point out. Imagine

foo=
foo << 5 << 2 << -1 << 7
p foo

foo.sort!{|a,b| a<=>b}
p foo
p foo.sorted?

foo.sort!{|a,b| b<=>a}
p foo
p foo.sorted?

printing
[5, 2, -1, 7]
[-1, 2, 5, 7]
true
[7, 5, 2, -1]
true

Anybody reading the sorted flag must know at the same time, which ordering
was applied. So you have to record some of the information outside this
array which IMHO is not a good idea.

An overall better approach might be a SortedArray
inheriting from Array – it could even have its own
definitions of << and so on that would preserve
sorting. And the method of sorting would (could/
should/might?) be fixed on an instance basis.

Exactly, apart the problems with = remain, as others have pointed out.

robert
···

----- Original Message -----

“Mauricio Fernández” batsman.geo@yahoo.com schrieb im Newsbeitrag
news:20030528053926.GA20904@student.ei.uni-stuttgart.de

We cannot optimize w/ the @sorted flag because Array is used for two
different purposes:

  • poor man’s set (sorting would help here for include? and the like)
  • plain old array (you don’t want sorting as you expect to find one
    element at the index you inserted it)

Unless we create 2 classes specifically for these 2 different semantics,
nothing can be done.

Thanks for pointing this out with so few words!

So if anything, the first step would be creating a Set class. Set
objects could be hinted on creation whether to use internally a
sorted array, a hash ( object => true ), etc, depending on the desired
complexity and space/time trade-off.

Yes. Apart from that I’d prefer different classes over a flag at
construction time. It’s more OO and you need several classes for the
implementation anyway.

Cheers

robert

Ruby 1.8p2 has a Set class. There is a ‘sort’ instance method. Some
preliminary documentation can be found at:

There is also a new ‘to_set’ method that set.rb adds to Enumerable.

I am very happy that this has been added and hope to see relations
added as well.

Regards,

Mark

···

On Wednesday, May 28, 2003, at 01:39 AM, Mauricio Fernández wrote:

[snip]

So if anything, the first step would be creating a Set class. Set
objects could be hinted on creation whether to use internally a
sorted array, a hash ( object => true ), etc, depending on the desired
complexity and space/time trade-off.

[snip]