Querying persistent ruby objects in memory

Alexy_Khrabrov · 26 May 2007 21:05

I have a data-mining task which loads data as a big XML tree (10+ MB)
and then reorganizes it. Even loading it with Hpricot takes 10-20
seconds. I don't want to do it for every manilupation I want to try,
especially for sequences of transformations.

Thus I wonder what's a good way to keep the huge object in memory
between the runs of querying scripts. Can Rails be used for that?
I'd rather avoid writing a client-server platform, or using it per se,
unless there's already an existing one. A vague intuition is, it
should be something like threads -- one thread parses XML and keeps it
in memory, another starts up later, somehow joins the memory space of
the first one, queries/transforms it, and ends. Then other queries/
transformations can all be run. Is there anything like it?

Cheers,
Alexy

Robert_K1 · 26 May 2007 21:25

I'd consider using Marshal.

Kind regards

robert

···

On 26.05.2007 23:00, braver wrote:

I have a data-mining task which loads data as a big XML tree (10+ MB)
and then reorganizes it. Even loading it with Hpricot takes 10-20
seconds. I don't want to do it for every manilupation I want to try,
especially for sequences of transformations.

Thus I wonder what's a good way to keep the huge object in memory
between the runs of querying scripts. Can Rails be used for that?
I'd rather avoid writing a client-server platform, or using it per se,
unless there's already an existing one. A vague intuition is, it
should be something like threads -- one thread parses XML and keeps it
in memory, another starts up later, somehow joins the memory space of
the first one, queries/transforms it, and ends. Then other queries/
transformations can all be run. Is there anything like it?

Alexy_Khrabrov · 26 May 2007 22:35

That's just plain serialization, isn't it? I've seen that and
Madelaine; but my wish is to keep the objects in memory without the
need to dump/reload it, however fast. (That would be a last resort.)

The question is, can we keep an object in memory in one thread, and
explore/change it from another? In the worst case, we can probably
quickly dump an object into a memory region and reload it back via
Marshal -- I guess a crude solution is forming here, using shared
memory or RAM disk -- have to see what's there for macs... But still
I wonder what folks think in terms of all kinds of RAM persistence in
ruby solutions.

Cheers,
Alexy

···

On May 26, 2:24 pm, Robert Klemme <shortcut...@googlemail.com> wrote:

I'd consider using Marshal.

Francis_Cianfrocca · 26 May 2007 23:13

Aren't you overengineering a little? You want to amortize a ten-second
startup cost over a (presumably) large number of operations against some
dataset. But you keep talking about threads. That tells me that your process
will run for a long time and will know all the operations it has to execute
upfront. In that case, forget about threads and just serialize your
operations. Your life will be much simpler.

But on the other hand, you talk about shared memory and about not wanting to
write a client/server application. That suggests that you're thinking of
keeping this dataset around and having other PROCESSES sent requests to it
at arbitrary times. In that case, don't use threads either, or shared-memory
for that matter. Life is too short to debug all that stuff. Write yourself a
little client-server application and be done with it. If you don't want to
deal with the network programming, use EventMachine.

···

On 5/26/07, braver <deliverable@gmail.com> wrote:

On May 26, 2:24 pm, Robert Klemme <shortcut...@googlemail.com> wrote:
> I'd consider using Marshal.

That's just plain serialization, isn't it? I've seen that and
Madelaine; but my wish is to keep the objects in memory without the
need to dump/reload it, however fast. (That would be a last resort.)

The question is, can we keep an object in memory in one thread, and
explore/change it from another? In the worst case, we can probably
quickly dump an object into a memory region and reload it back via
Marshal -- I guess a crude solution is forming here, using shared
memory or RAM disk -- have to see what's there for macs... But still
I wonder what folks think in terms of all kinds of RAM persistence in
ruby solutions.

Robert_K1 · 27 May 2007 08:40

I'd consider using Marshal.

That's just plain serialization, isn't it? I've seen that and
Madelaine; but my wish is to keep the objects in memory without the
need to dump/reload it, however fast. (That would be a last resort.)

I find that odd. Keeping something in memory is usually a *solution* for some kind of *business requirement* (e.g. to make things fast). Why would you want to keep something in mem if it can be persisted on disk really fast? I don't know the volume of what you need to handle but did you actually try out how fast it is?

The question is, can we keep an object in memory in one thread, and
explore/change it from another?

Yes, of course. Easily sharing memory is one (if not *the*) major aspect of multithreaded applications. But reading your other posting I am not sure whether you have the proper idea of MT programming. If you only want to do one set of manipulations at a time you do not need multiple threads because there is no concurrency involved.

In the worst case, we can probably
quickly dump an object into a memory region and reload it back via
Marshal -- I guess a crude solution is forming here, using shared
memory or RAM disk -- have to see what's there for macs... But still
I wonder what folks think in terms of all kinds of RAM persistence in
ruby solutions.

As James suggested using DRb is one option. Then you can decide whether to manipulate the object graph in the server process or send it off to the client (and probably send it back after doing your changes). It's probably the best solution in your case because you can start arbitrary client processes and manipulate state in the server. But you should make sure that access is proper synchronized to cope with multiple clients that connect concurrently.

Kind regards

robert

···

On 27.05.2007 00:33, braver wrote:

On May 26, 2:24 pm, Robert Klemme <shortcut...@googlemail.com> wrote:

James_Tucker · 27 May 2007 06:12

Someone else was talking about this kind of problem the other day in #ruby-lang.

Another posted an elegant solution to the problem (which incidentally was refused as it was another process), however:

#!ruby
raise 'You need to install win32/process' unless require 'win32/process' if RUBY_PLATFORM.include? 'mswin32'
# parent forks off and dies, leaving child as daemon
exit 0 if !fork.nil?

# daemon code starts here
require 'drb/drb'
require 'thread'
require 'server'

$SAFE = 1 # disable eval() and friends

DRb.start_service("druby://:2020", Server.new)
puts DRb.uri
DRb.thread.join

Francis Cianfrocca wrote:

···

On 5/26/07, braver <deliverable@gmail.com> wrote:

On May 26, 2:24 pm, Robert Klemme <shortcut...@googlemail.com> wrote:
> I'd consider using Marshal.

That's just plain serialization, isn't it? I've seen that and
Madelaine; but my wish is to keep the objects in memory without the
need to dump/reload it, however fast. (That would be a last resort.)

The question is, can we keep an object in memory in one thread, and
explore/change it from another? In the worst case, we can probably
quickly dump an object into a memory region and reload it back via
Marshal -- I guess a crude solution is forming here, using shared
memory or RAM disk -- have to see what's there for macs... But still
I wonder what folks think in terms of all kinds of RAM persistence in
ruby solutions.

Aren't you overengineering a little? You want to amortize a ten-second
startup cost over a (presumably) large number of operations against some
dataset. But you keep talking about threads. That tells me that your process
will run for a long time and will know all the operations it has to execute
upfront. In that case, forget about threads and just serialize your
operations. Your life will be much simpler.

But on the other hand, you talk about shared memory and about not wanting to
write a client/server application. That suggests that you're thinking of
keeping this dataset around and having other PROCESSES sent requests to it
at arbitrary times. In that case, don't use threads either, or shared-memory
for that matter. Life is too short to debug all that stuff. Write yourself a
little client-server application and be done with it. If you don't want to
deal with the network programming, use EventMachine.

Topic		Replies	Views
Accessing ruby objects across VMs ruby-talk	4	110	21 April 2009
Performance issues with large files -- ruby vs. python :) ruby-talk	15	167	14 May 2009
Webcrawler that become enormous ruby-talk	1	92	17 November 2011
Object database ruby-talk	2	91	1 November 2002
Accessing Ruby Object in Memory ruby-talk	4	107	13 October 2008

Querying persistent ruby objects in memory

Related topics