Pre-allocate large amount of memory?

I've created a small daemon, that serves certain data very fast to our
remaining web-application. It does so by pre-loading a lot of data from
our database into a special structure in memory. Boiled down, the
pre-loading procedure looks like this:

@data_hash = {}
DBI.connect(dns, user, pass) do |dbh|
  sth = dbh.execute("some-sql-returning >300.000 rows")
  while row = sth.fetch_hash
    hash_key = <calculated-from-row-data>

    @owner_relations[hash_key] ||= []
    @owner_relations[hash_key] << row
  end
  sth.finish
end

It gives a @data_hash with a structure like this:

@data_hash = {
  'key1' => [row1, row2, row3, ...]
  'key2' => [row4, row5, ...]
  ...
}

I have a problem with the speed of the pre-loading though. I am loading
~360.000 rows into memory (about 500MB). The first 50% of the rows are
read and placed in the hash pretty quick. But after that it slows down.

I suspect that Ruby's dynamic memory allocation for the Hash is to
blaim. Can I somehow pre-allocate 500MB memory for the hash? Or tweak
the way Ruby allocates memory?

Thanks in advance!

- Carsten

···

--
Posted via http://www.ruby-forum.com/.

You could probably just use memcached. It's optimized for efficient
memory allocation of hash based data stores.

···

On Mon, Sep 14, 2009 at 5:15 AM, Carsten Gehling <carsten@sarum.dk> wrote:

I've created a small daemon, that serves certain data very fast to our
remaining web-application. It does so by pre-loading a lot of data from
our database into a special structure in memory. Boiled down, the
pre-loading procedure looks like this:

@data_hash = {}
DBI.connect(dns, user, pass) do |dbh|
sth = dbh.execute("some-sql-returning >300.000 rows")
while row = sth.fetch_hash
hash_key = <calculated-from-row-data>

@owner_relations[hash_key] ||=
@owner_relations[hash_key] << row
end
sth.finish
end

It gives a @data_hash with a structure like this:

@data_hash = {
'key1' => [row1, row2, row3, ...]
'key2' => [row4, row5, ...]
...
}

I have a problem with the speed of the pre-loading though. I am loading
~360.000 rows into memory (about 500MB). The first 50% of the rows are
read and placed in the hash pretty quick. But after that it slows down.

I suspect that Ruby's dynamic memory allocation for the Hash is to
blaim. Can I somehow pre-allocate 500MB memory for the hash? Or tweak
the way Ruby allocates memory?

Thanks in advance!

- Carsten
--
Posted via http://www.ruby-forum.com/\.

Maybe it's not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS's
shine. What about querying individual records and storing them in a
LRU storage. You'll find such a beast here which can be used like a
Hash:

Then you don't burn large amounts of memory but retain the caching effect.

Cheers

robert

···

2009/9/14 Carsten Gehling <carsten@sarum.dk>:

I've created a small daemon, that serves certain data very fast to our
remaining web-application. It does so by pre-loading a lot of data from
our database into a special structure in memory. Boiled down, the
pre-loading procedure looks like this:

@data_hash = {}
DBI.connect(dns, user, pass) do |dbh|
sth = dbh.execute("some-sql-returning >300.000 rows")
while row = sth.fetch_hash
hash_key = <calculated-from-row-data>

@owner_relations[hash_key] ||=
@owner_relations[hash_key] << row
end
sth.finish
end

It gives a @data_hash with a structure like this:

@data_hash = {
'key1' => [row1, row2, row3, ...]
'key2' => [row4, row5, ...]
...
}

I have a problem with the speed of the pre-loading though. I am loading
~360.000 rows into memory (about 500MB). The first 50% of the rows are
read and placed in the hash pretty quick. But after that it slows down.

I suspect that Ruby's dynamic memory allocation for the Hash is to
blaim. Can I somehow pre-allocate 500MB memory for the hash? Or tweak
the way Ruby allocates memory?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Jason Stewart wrote:

You could probably just use memcached. It's optimized for efficient
memory allocation of hash based data stores.

I thought of that too. But doesn't memcached expire its data after a
while? Or is it possible to disable that?

- Carsten

···

--
Posted via http://www.ruby-forum.com/\.

The super easy way to deal with this is to do what James suggested -- use redis.

It's absurdly easy to use, and quite fast. You ARE limited by the
amount of data you can put into RAM, but you are doing that anyway, so
I assume that's a reasonable limitation.

Using some sort of an LRU cache may work, too. It depends on whether
the query pattern is random access, or whether it it is clustered
around certain sets of data out of the whole, though.

Kirk Haines

···

On Mon, Sep 14, 2009 at 10:05 AM, Robert Klemme <shortcutter@googlemail.com> wrote:

Maybe it's not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS's
shine. What about querying individual records and storing them in a
LRU storage. You'll find such a beast here which can be used like a
Hash:

muppet-laboratories/lib at 9eb71ab7bf7d5db7396905b500bd3ac7cc153e0a · rklemme/muppet-laboratories · GitHub

Then you don't burn large amounts of memory but retain the caching effect.

Robert Klemme wrote:

Maybe it's not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS's
shine.

My reason for doing this is simply to avoid multiple sql-queries in a
recursive self-referencing datastructure. Usually this can be avoided by
querying the entire subset of data needed, and manipulate it in memory.
Unfortunately the data in question does not enable me to just query the
subset of data.

I will definetely have a look at Redis. Seems promising.

Thanks.

- Carsten

···

--
Posted via http://www.ruby-forum.com/\.

You might consider using Redis:

   Google Code Archive - Long-term storage for Google Code Project Hosting.

It definitely doesn't have to expire keys. That's an optional feature.

I'll be posting a series of articles to my blog this week about using it from Ruby, one each morning:

   Gray Soft / Not Found

James Edward Gray II

···

On Sep 14, 2009, at 8:32 AM, Carsten Gehling wrote:

Jason Stewart wrote:

You could probably just use memcached. It's optimized for efficient
memory allocation of hash based data stores.

I thought of that too. But doesn't memcached expire its data after a
while? Or is it possible to disable that?

I had a situation like that, where I ended up using one of the nested set
implementations (I think it was Awesomer Nested Set) instead of a tree
hierarchy. I ended up having to write my own (very ugly) iterating
interpreter which analyzed each node in to figure out it's recursive
context. It took something like 3 or 4 blocks for the different places you
could inject code based on a recursive interpretation, but it did get me
down to a single database call, and it did only require a single pass
through the data. At one point (debugging, ahem) I was very strongly
considering going with the approach you are taking. Good luck with it, and
let me know how it turns out, I might opt for that approach instead, next
time.

···

On Mon, Sep 14, 2009 at 12:37 PM, Carsten Gehling <carsten@sarum.dk> wrote:

Robert Klemme wrote:
>
> Maybe it's not such a good idea to read everything in memory. After
> all, retrieving large volumes of data from disk is where RDBMS's
> shine.

My reason for doing this is simply to avoid multiple sql-queries in a
recursive self-referencing datastructure. Usually this can be avoided by
querying the entire subset of data needed, and manipulate it in memory.
Unfortunately the data in question does not enable me to just query the
subset of data.

I will definetely have a look at Redis. Seems promising.

Thanks.

- Carsten
--
Posted via http://www.ruby-forum.com/\.

Is your structure strictly hierarchical (i.e. a tree) or do you need to query part of a graph with cycles? If it is strictly hierarchical there is a solution that works for all RDBMS and in the latter case there are solutions for some RDBMS.

Kind regards

  robert

···

On 14.09.2009 19:37, Carsten Gehling wrote:

Robert Klemme wrote:

Maybe it's not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS's
shine.

My reason for doing this is simply to avoid multiple sql-queries in a recursive self-referencing datastructure. Usually this can be avoided by querying the entire subset of data needed, and manipulate it in memory. Unfortunately the data in question does not enable me to just query the subset of data.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert Klemme wrote:

Is your structure strictly hierarchical (i.e. a tree) or do you need to
query part of a graph with cycles? If it is strictly hierarchical there
is a solution that works for all RDBMS and in the latter case there are
solutions for some RDBMS.

Unfortunately it is not strictly hierarchial. It is relations between
companies-companies and companies-persons, that represents shareholders,
parent-/subsidiary companies, etc.

My job is to - given a certain company X - to extract all its owners
and, recursively, their owners, etc. until I reach the "top". Likewise
the other way to extract all companies owned by company X and,
recursively, all companies owned by them, etc.

Conceptually, the table (actually a view) I am querying holds the data:

CompanyA, Direction, CompanyB, Share
"FooCorp", "owns", "BarCorp", "10%"
"BarCorp", "ownedby", "FooCorp", "10%"
"BarCorp", "owns", "BazCorp", "45%"
"BasCorp", "ownedby", "BarCorp", "45%"
"QweCorp", "owns", "RteCorp", "20%"
"RteCorp", "ownedby", "QweCorp", "20%"
etc.

The table is not of my doing. It is very difficult (at least for me) to
devise a way to only query the nessecery rows in the table, without
sorting to recursive calls.

In the example above: If given "FooCorp", all but the last two rows
should be extracted and used in the result. How would you go about doing
that?

I haven't found a solution yet. This is why I've gone and made a
service, that holds all these data in memory (in a hash), to speed up on
things.

BTW: Thanks for all your great suggestions so far. :slight_smile:

- Carsten

···

--
Posted via http://www.ruby-forum.com/\.

As far as retrieving them all in a single query, nested sets would do this.
There are other hits that you take from using them, though. Changing the
hierarchy, for example, is very expensive (ie company a sells company b or
purchases company c), and treating it as a recursive structure after you
have it retrieved is quite painful. If you do opt for this solution, I can
send you my code to give recursive functionality through iteration. Though I
did change some lines in the plugin itself, to get more flexible use, but I
think I marked them all.

···

On Mon, Sep 14, 2009 at 11:29 PM, Carsten Gehling <carsten@sarum.dk> wrote:

Robert Klemme wrote:

> Is your structure strictly hierarchical (i.e. a tree) or do you need to
> query part of a graph with cycles? If it is strictly hierarchical there
> is a solution that works for all RDBMS and in the latter case there are
> solutions for some RDBMS.

Unfortunately it is not strictly hierarchial. It is relations between
companies-companies and companies-persons, that represents shareholders,
parent-/subsidiary companies, etc.

My job is to - given a certain company X - to extract all its owners
and, recursively, their owners, etc. until I reach the "top". Likewise
the other way to extract all companies owned by company X and,
recursively, all companies owned by them, etc.

Conceptually, the table (actually a view) I am querying holds the data:

CompanyA, Direction, CompanyB, Share
"FooCorp", "owns", "BarCorp", "10%"
"BarCorp", "ownedby", "FooCorp", "10%"
"BarCorp", "owns", "BazCorp", "45%"
"BasCorp", "ownedby", "BarCorp", "45%"
"QweCorp", "owns", "RteCorp", "20%"
"RteCorp", "ownedby", "QweCorp", "20%"
etc.

The table is not of my doing. It is very difficult (at least for me) to
devise a way to only query the nessecery rows in the table, without
sorting to recursive calls.

In the example above: If given "FooCorp", all but the last two rows
should be extracted and used in the result. How would you go about doing
that?

I haven't found a solution yet. This is why I've gone and made a
service, that holds all these data in memory (in a hash), to speed up on
things.

BTW: Thanks for all your great suggestions so far. :slight_smile:

- Carsten
--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote:

Is your structure strictly hierarchical (i.e. a tree) or do you need to
query part of a graph with cycles? If it is strictly hierarchical there
is a solution that works for all RDBMS and in the latter case there are
solutions for some RDBMS.

Unfortunately it is not strictly hierarchial. It is relations between
companies-companies and companies-persons, that represents shareholders,
parent-/subsidiary companies, etc.

So you do actually allow for loops, i.e. company A owns 10% of company
B which owns 10% of company A.

My job is to - given a certain company X - to extract all its owners
and, recursively, their owners, etc. until I reach the "top". Likewise
the other way to extract all companies owned by company X and,
recursively, all companies owned by them, etc.

Conceptually, the table (actually a view) I am querying holds the data:

CompanyA, Direction, CompanyB, Share
"FooCorp", "owns", "BarCorp", "10%"
"BarCorp", "ownedby", "FooCorp", "10%"
"BarCorp", "owns", "BazCorp", "45%"
"BasCorp", "ownedby", "BarCorp", "45%"
"QweCorp", "owns", "RteCorp", "20%"
"RteCorp", "ownedby", "QweCorp", "20%"
etc.

The table is not of my doing. It is very difficult (at least for me) to
devise a way to only query the nessecery rows in the table, without
sorting to recursive calls.

If I understand that table design properly it is awful because
semantics of columns one, three and four change based on content of
column two. The usual way would be to model this with fixed
semantics, i.e. only have one direction of ownership relation in the
table. In your case you will probably have to do a normalization step
by defining a view on this table with a UNION ALL or use a WITH clause
in the query to ensure the query can be built in a reasonable way.

In the example above: If given "FooCorp", all but the last two rows
should be extracted and used in the result. How would you go about doing
that?

There are features in modern RDBMS which allow for recursive querying.
In PostgreSQL and Microsoft SQL Server you can use WITH expression:

In Oracle there is CONNECT BY
http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_10002.htm#i2066102

The downside is that recursive queries tend to have a performance hit
as the DB engine cannot fetch everything via a single index and needs
to look at results so far to know what other records it has to
retrieve.

The nested set model (Josh mentioned it as well) might help although I
haven't thought through all implications in your case:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
http://www.codeproject.com/KB/database/nestedsets.aspx

I haven't found a solution yet. This is why I've gone and made a
service, that holds all these data in memory (in a hash), to speed up on
things.

You still have the issue that you maintain redundant data and must
find a way to invalidate your cache when the base data changes. Since
a complete reload of the cache is expensive (as you have experienced)
this will get complicated soon because in order to find the data that
must be refreshed you need similar queries like those to find answers
to the questions you placed above.

BTW: Thanks for all your great suggestions so far. :slight_smile:

You're welcome!

Kind regards

robert

···

2009/9/15 Carsten Gehling <carsten@sarum.dk>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert Klemme wrote:

So you do actually allow for loops, i.e. company A owns 10% of company
B which owns 10% of company A.

Yes that is allowed (we define some rules in the extraction code to
avoid infinite recursion).

If I understand that table design properly it is awful because
semantics of columns one, three and four change based on content of
column two. The usual way would be to model this with fixed
semantics, i.e. only have one direction of ownership relation in the
table. In your case you will probably have to do a normalization step
by defining a view on this table with a UNION ALL or use a WITH clause
in the query to ensure the query can be built in a reasonable way.

Actually the above structure is a view, which adds this direction. The
real data-table only holds one record each relation, that is:

"BasCorp", "ownedby", "BarCorp", "45%"
"RteCorp", "ownedby", "QweCorp", "20%"

There are features in modern RDBMS which allow for recursive querying.
In PostgreSQL and Microsoft SQL Server you can use WITH expression:
PostgreSQL: Documentation: 8.4: WITH Queries (Common Table Expressions)
WITH common_table_expression (Transact-SQL) | Microsoft Learn

I am using Microsoft SQL Server, but I didn't know about recursive
querying.

The downside is that recursive queries tend to have a performance hit

Perhaps the best way will be going with my current setup (i.e. loading
the entire data). But use recursive querying to reload part of the data
when relations are changed.

The nested set model (Josh mentioned it as well) might help although I
haven't thought through all implications in your case:

I would rather not begin to alter my data structure.

You still have the issue that you maintain redundant data and must
find a way to invalidate your cache when the base data changes.

I think that I may have found a way to calculate, exactly how many
records I need to reload, if the relation between two companies are
added/changed/deleted. And SQL Sever's WITH option might help me there.

I will have a look at that now. Thanks a bunch for your input - all of
you.

I will post my results, when I get there.

- Carsten

···

--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote:

So you do actually allow for loops, i.e. company A owns 10% of company
B which owns 10% of company A.

Yes that is allowed (we define some rules in the extraction code to
avoid infinite recursion).

Good.

If I understand that table design properly it is awful because
semantics of columns one, three and four change based on content of
column two. The usual way would be to model this with fixed
semantics, i.e. only have one direction of ownership relation in the
table. In your case you will probably have to do a normalization step
by defining a view on this table with a UNION ALL or use a WITH clause
in the query to ensure the query can be built in a reasonable way.

Actually the above structure is a view, which adds this direction. The
real data-table only holds one record each relation, that is:

"BasCorp", "ownedby", "BarCorp", "45%"
"RteCorp", "ownedby", "QweCorp", "20%"

Oh.

There are features in modern RDBMS which allow for recursive querying.
In PostgreSQL and Microsoft SQL Server you can use WITH expression:
PostgreSQL: Documentation: 8.4: WITH Queries (Common Table Expressions)
WITH common_table_expression (Transact-SQL) | Microsoft Learn

I am using Microsoft SQL Server, but I didn't know about recursive
querying.

The downside is that recursive queries tend to have a performance hit

Perhaps the best way will be going with my current setup (i.e. loading
the entire data). But use recursive querying to reload part of the data
when relations are changed.

Before you do that: I would first try out the query and see how well
it performs with your data. If not, then maybe additional indexing
may help. Only if that also fails _then_ I would switch to a caching
approach.

The reason: I would try to keep architecture simple and avoid
redundant data storage as this tends to cause issues. If you can get
the data you need just in time via a reasonable efficient query then
this is much more preferable to your big caching process approach.

The nested set model (Josh mentioned it as well) might help although I
haven't thought through all implications in your case:

I would rather not begin to alter my data structure.

Why?

You still have the issue that you maintain redundant data and must
find a way to invalidate your cache when the base data changes.

I think that I may have found a way to calculate, exactly how many
records I need to reload, if the relation between two companies are
added/changed/deleted. And SQL Sever's WITH option might help me there.

I will have a look at that now. Thanks a bunch for your input - all of
you.

I will post my results, when I get there.

We'll be waiting eagerly. :slight_smile:

Cheers

robert

···

2009/9/15 Carsten Gehling <carsten@sarum.dk>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

My 2 cents

I have used this kind of caching method in past and works really well for
me. I was able to speed up a database-with-very-heavy-nested-data
to csv script by more than 30x and this despite the fact that I had analyzed
all queries and had all necessary indexes in place.
So I guess atleast it works if the amount of data is not very big. I
guess(and I can be severely wrong) that can also be overcome by using
something like localmemcache gem which can be used to dump this data on an
arbitrary memory location which is not handled by ruby process.

Piyush

···

On Tue, Sep 15, 2009 at 6:14 PM, Robert Klemme <shortcutter@googlemail.com>wrote:

2009/9/15 Carsten Gehling <carsten@sarum.dk>:
> Robert Klemme wrote:
>
>> So you do actually allow for loops, i.e. company A owns 10% of company
>> B which owns 10% of company A.
>
> Yes that is allowed (we define some rules in the extraction code to
> avoid infinite recursion).

Good.

>> If I understand that table design properly it is awful because
>> semantics of columns one, three and four change based on content of
>> column two. The usual way would be to model this with fixed
>> semantics, i.e. only have one direction of ownership relation in the
>> table. In your case you will probably have to do a normalization step
>> by defining a view on this table with a UNION ALL or use a WITH clause
>> in the query to ensure the query can be built in a reasonable way.
>
> Actually the above structure is a view, which adds this direction. The
> real data-table only holds one record each relation, that is:
>
> "BasCorp", "ownedby", "BarCorp", "45%"
> "RteCorp", "ownedby", "QweCorp", "20%"

Oh.

>> There are features in modern RDBMS which allow for recursive querying.
>> In PostgreSQL and Microsoft SQL Server you can use WITH expression:
>> PostgreSQL: Documentation: 8.4: WITH Queries (Common Table Expressions)
>> WITH common_table_expression (Transact-SQL) | Microsoft Learn
>
> I am using Microsoft SQL Server, but I didn't know about recursive
> querying.
>
>> The downside is that recursive queries tend to have a performance hit
>
> Perhaps the best way will be going with my current setup (i.e. loading
> the entire data). But use recursive querying to reload part of the data
> when relations are changed.

Before you do that: I would first try out the query and see how well
it performs with your data. If not, then maybe additional indexing
may help. Only if that also fails _then_ I would switch to a caching
approach.

The reason: I would try to keep architecture simple and avoid
redundant data storage as this tends to cause issues. If you can get
the data you need just in time via a reasonable efficient query then
this is much more preferable to your big caching process approach.

>> The nested set model (Josh mentioned it as well) might help although I
>> haven't thought through all implications in your case:
>
> I would rather not begin to alter my data structure.

Why?

>> You still have the issue that you maintain redundant data and must
>> find a way to invalidate your cache when the base data changes.
>
> I think that I may have found a way to calculate, exactly how many
> records I need to reload, if the relation between two companies are
> added/changed/deleted. And SQL Sever's WITH option might help me there.
>
> I will have a look at that now. Thanks a bunch for your input - all of
> you.
>
> I will post my results, when I get there.

We'll be waiting eagerly. :slight_smile:

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/