Memory-efficient set of Fixnums

Hi,

I have to:
1) generate a database of a couple dozens of millions of Fixnums
(ranging from 0 to 2^32 - 1), while avoiding redundancy
2) iterate through them
3) quickly search for the presence of a given Fixnum in the database

The Set class fulfills the speed conditions and conveniently handles
redundancy itself, but its uses up too much memory. It looks like each
entry uses up around a 100 bytes, even if I only put 4 bytes in there.
Array#include? is too slow without solving the memory problem.
Representing each Fixnum as 4 bytes in a huge String doesn't use up much
memory at all and String#include? is fast enough, but I can't tell it to
only search by 4 bytes increments.

Could someone help me with a solution for this problem? Thank you in
advance.

···

--
Posted via http://www.ruby-forum.com/.

Did you try a Hash?

Thanks, but Hash has an even bigger memory imprint, so that won't do
either.

···

--
Posted via http://www.ruby-forum.com/.

You could use a BigNum as bitset. But that might be slow. I once
created a BitSet class which uses a String internally for storage.
IIRC that was quite efficient and fast. I put up an illustrating
example here

Kind regards

robert

···

On Wed, May 23, 2012 at 12:22 PM, George Dupre <lists@ruby-forum.com> wrote:

I have to:
1) generate a database of a couple dozens of millions of Fixnums
(ranging from 0 to 2^32 - 1), while avoiding redundancy
2) iterate through them
3) quickly search for the presence of a given Fixnum in the database

The Set class fulfills the speed conditions and conveniently handles
redundancy itself, but its uses up too much memory. It looks like each
entry uses up around a 100 bytes, even if I only put 4 bytes in there.
Array#include? is too slow without solving the memory problem.
Representing each Fixnum as 4 bytes in a huge String doesn't use up much
memory at all and String#include? is fast enough, but I can't tell it to
only search by 4 bytes increments.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

George Dupre wrote in post #1061810:

1) generate a database of a couple dozens of millions of Fixnums
(ranging from 0 to 2^32 - 1), while avoiding redundancy
2) iterate through them
3) quickly search for the presence of a given Fixnum in the database

Hi,

For memory efficiency, if you are on Windows, you can use RTensor with
type of "unsigned". If not, you can use other memory-efficient
structure for 4-byte data (such as NArray, although its "int" is 4-byte
signed integer).

For quick search, as has been pointed to, you have to store the data in
sorted order, so that you can run binary search.

If your data are static, it is fine. If your data are dynamic (often
insert and remove), then we still need to find another data structure
(such as C++ STL Set).

Regards,

Bill

···

--
Posted via http://www.ruby-forum.com/\.

Can you afford a fixed 512MB of RAM? Then you can just have a 2^32 entry
bitmap (2^29 bytes)

Iteration might be fast enough, you just need to search for bytes which
are not \x00

···

--
Posted via http://www.ruby-forum.com/.

Thank you for your help,

Sadly, it's dynamic data. Basically, I start with half a million
integers, then I generate from this a bigger set of integers (using
millions of calls to the include? method in the process), that I must
add to the initial set while avoiding redundancy. Then rinse and repeat
until around 20 million.

I tried using a long string encoded in UTF-32BE, but the include? method
is much too slow. These numbers are out of range for NArray, and this
has to run on Linux so I can't use RTensor either. A 512 MB Bitmap could
work. I will start by trying Robert Klemme's BitSet. Thank you very
much.

Regards,

George

···

--
Posted via http://www.ruby-forum.com/.

Alright, using Robert Klemme's BitSet class, Brian Candler's suggestion
for the full database mixed with Sets for intermediate steps did the
trick. It may be a bit too slow, but nothing unreasonable, and some time
can realistically be saved here and there.

Thank you very much for your valuable help.

Best regards,

George

···

--
Posted via http://www.ruby-forum.com/.

I have to:
1) generate a database of a couple dozens of millions of Fixnums
(ranging from 0 to 2^32 - 1), while avoiding redundancy

google_hash gem may help.

This seems like one of those %w[small fast good].pick(2) problems. Also... smells like homework.

But if not, it reminds me of this article:

and their implementation:

it's java, so you could use it directly via jruby... or you can use unix IO via the included shell scripts

···

On May 23, 2012, at 03:22 , George Dupre wrote:

I have to:
1) generate a database of a couple dozens of millions of Fixnums
(ranging from 0 to 2^32 - 1), while avoiding redundancy
2) iterate through them
3) quickly search for the presence of a given Fixnum in the database

The Set class fulfills the speed conditions and conveniently handles
redundancy itself, but its uses up too much memory. It looks like each
entry uses up around a 100 bytes, even if I only put 4 bytes in there.
Array#include? is too slow without solving the memory problem.
Representing each Fixnum as 4 bytes in a huge String doesn't use up much
memory at all and String#include? is fast enough, but I can't tell it to
only search by 4 bytes increments.

Could someone help me with a solution for this problem? Thank you in
advance.

Hi,

Sorry, I should have given more details. While I am doing this is an
academic setting, it isn't homework in the traditional sense. This is
part of my "personal initiative project", reaching outside of my field
of studies (math and physics) but with outside help allowed. It revolves
around a Korean 4x4 board game, which I am currently trying to solve
through brute force, hence the ridiculous requirements. Each possible
position of the board is matched to an integer. An error of zero is
imperative, but using 512 MB of RAM isn't an issue at all: I only need
to run that calculation once, in a reasonable time. I will also look
into this google_hash gem, it should also be useful.

Thank you once again for your help.

Best regards,

George

···

--
Posted via http://www.ruby-forum.com/.

I advanced the implementation a bit

Thank you very much, this is incredibly useful!

I think in Ruby speed is not the primary objective. So if you need the

code to run faster (by about the same order of magnitude), probably you
can transform the critical part into C.
I don't know much C at all, but I did use RubyInline for a few critical,
but very simple methods. With that and Robert Klemme's BitSet, the
program is efficient enough, now I just have to let it run for a while.

Thank you once again for your help!

···

--
Posted via http://www.ruby-forum.com/\.

Well, how about using binary search on a sorted array? That's the simplest way.

···

2012/5/23, George Dupre <lists@ruby-forum.com>:

Thanks, but Hash has an even bigger memory imprint, so that won't do
either.

--
Posted via http://www.ruby-forum.com/\.

--
Wysłane z mojego urządzenia przenośnego

-- Matma Rex

Set uses a Hash internally. It's basically just a different interface to a Hash.

-Justin

···

On 05/23/2012 06:17 AM, Chris Hulan wrote:

Did you try a Hash?

George Dupre wrote in post #1061849:

Sadly, it's dynamic data. Basically, I start with half a million
integers, then I generate from this a bigger set of integers (using
millions of calls to the include? method in the process), that I must
add to the initial set while avoiding redundancy. Then rinse and repeat
until around 20 million.

Hi,

Based on your parameter description, I think the bitmap suggested by
Brian Candler is the best choice.

Regards,

Bill

···

--
Posted via http://www.ruby-forum.com/\.

Ryan Davis wrote in post #1061877:

Also... smells like homework.

But if not, it reminds me of this article:

Hi,

Whether it is a homework or not, it is a very realistic problem that we
may encounter in our daily programming.

The article is about using probabilistic algorithms with some level of
error. I think we all assumed that the original poster wants an error
of zero. (If the problem could not be solved using several gigabytes of
RAM, or if the computation time took too long, then probabilistic
algorithms ought to be considered.)

Regards,

Bill

···

--
Posted via http://www.ruby-forum.com/\.

George Dupre wrote in post #1062009:

I will also look into this google_hash gem, it should also be useful.

Hi,

I think in Ruby speed is not the primary objective. So if you need the
code to run faster (by about the same order of magnitude), probably you
can transform the critical part into C.

Regards,

Bill

···

--
Posted via http://www.ruby-forum.com/\.

Hi George,

Sorry, I should have given more details. While I am doing this is an
academic setting, it isn't homework in the traditional sense. This is
part of my "personal initiative project", reaching outside of my field
of studies (math and physics) but with outside help allowed. It revolves
around a Korean 4x4 board game, which I am currently trying to solve
through brute force, hence the ridiculous requirements. Each possible
position of the board is matched to an integer. An error of zero is
imperative, but using 512 MB of RAM isn't an issue at all: I only need
to run that calculation once, in a reasonable time. I will also look
into this google_hash gem, it should also be useful.

Thank you once again for your help.

I advanced the implementation a bit
- now it doesn't choke as easily because of the single large String
- there are #clear and #empty? methods
- resetting is more efficient
- there is Enumerable functionality

Have fun!

Kind regards

robert

···

On Thu, May 24, 2012 at 7:34 PM, George Dupre <lists@ruby-forum.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

How about an array, where the the generated numbers are the indexes?
Can store a flag value, and any not set value defaults to nil
May be less memory than Hash?

Just for the record: that is the same approach as I suggested earlier.

Cheers

robert

···

On Wed, May 23, 2012 at 6:44 PM, Admin Tensor <lists@ruby-forum.com> wrote:

Based on your parameter description, I think the bitmap suggested by
Brian Candler is the best choice.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/