Why don't Ruby libraries share memory?

This paragraph is motivation. While my question is not Rails-specific, I am asking it because of Rails. I've been investigating the memory footprint of my Mongrels. It is nice that they share the .so libraries from ImageMagick as well as other C libraries. However, each one still has about 20MB in [heap]. My theory is that a lot of this is coming from ActiveRecord and friends getting loaded again and again for each Mongrel, which seems to me entirely unnecessary. My "marginal cost of Apache" is 1376kB. My "marginal cost of Mongrel" is 27528kB, with the code I wrote. It seems that the latter could be reduced a lot by sharing some Ruby libraries.

The question is as follows: if I require 'library' in one instance of Ruby and then require 'library' again in another instance of Ruby, then do I get duplicate copies of library's code in two chunks of my RAM? (I'm thinking I do.) Why?

For further details and perhaps clarification, consider the following script:

require 'smaps_parser'

smaps = SmapsParser.new(Process.pid)
puts smaps.sums.inspect

%w{rubygems active_record action_controller action_view RMagick}.each do |l|
  puts "\nRequiring #{l}."
  require l
  smaps.refresh
  puts smaps.sums.inspect
end

Though my Mongrel processes have already (each?) loaded copies of each l, and though there is nothing "private" about the code in each l, I get the following output, in which one should pay particular attention to the increase of [:private_dirty]:

{:rss=>1520, :shared_clean=>964, :shared_dirty=>0, :private_clean=>12, :size=>2968, :private_dirty=>544}

Requiring rubygems.
{:rss=>5032, :shared_clean=>1676, :shared_dirty=>0, :private_clean=>224, :size=>7476, :private_dirty=>3132}

Requiring active_record.
{:rss=>12920, :shared_clean=>1816, :shared_dirty=>0, :private_clean=>224, :size=>15452, :private_dirty=>10880}

Requiring action_controller.
{:rss=>18680, :shared_clean=>1828, :shared_dirty=>0, :private_clean=>228, :size=>21152, :private_dirty=>16624}

Requiring action_view.
{:rss=>21088, :shared_clean=>1828, :shared_dirty=>0, :private_clean=>228, :size=>23524, :private_dirty=>19032}

Requiring RMagick.
{:rss=>22512, :shared_clean=>2660, :shared_dirty=>0, :private_clean=>228, :size=>29792, :private_dirty=>19624}

Matt,

Every time you load a Mongrel instance it is loading a completely new Ruby runtime environment which can be modified in any way.

If you think about it, if each mongrel instance did share the libraries, when one running application on the system modified some of the library code at runtime it would effect all other running instances. This could be... bad, at best.

So to answer what I think is your question, yes you get duplicate copies in two "chunks" of your RAM.

Hope this helps,

   ~Wayne

s///g
Wayne E. Seguin
Sr. Systems Architect & Systems Administrator

···

On Aug 13, 2007, at 14:25 , Matt Harvey wrote:

This paragraph is motivation. While my question is not Rails-specific, I am asking it because of Rails. I've been investigating the memory footprint of my Mongrels. It is nice that they share the .so libraries from ImageMagick as well as other C libraries. However, each one still has about 20MB in [heap]. My theory is that a lot of this is coming from ActiveRecord and friends getting loaded again and again for each Mongrel, which seems to me entirely unnecessary. My "marginal cost of Apache" is 1376kB. My "marginal cost of Mongrel" is 27528kB, with the code I wrote. It seems that the latter could be reduced a lot by sharing some Ruby libraries.

The question is as follows: if I require 'library' in one instance of Ruby and then require 'library' again in another instance of Ruby, then do I get duplicate copies of library's code in two chunks of my RAM? (I'm thinking I do.) Why?
<snip>

I suppose the main problem is that Rails (or ActiveRecord, I don't
know exactly) is not thread-safe. That means you cannot share most of
its the data. That is the reason why you have to run more mongrels
compared to a one multi-threaded mongrel.

I don't know where exactly is the problem with rails/ar though, nor
whether is it at least theoretically solvable.

···

On 8/13/07, Matt Harvey <matt@teamdawg.org> wrote:

This paragraph is motivation. While my question is not Rails-specific, I am
asking it because of Rails. I've been investigating the memory footprint of
my Mongrels. It is nice that they share the .so libraries from ImageMagick
as well as other C libraries. However, each one still has about 20MB in
[heap]. My theory is that a lot of this is coming from ActiveRecord and
friends getting loaded again and again for each Mongrel, which seems to me
entirely unnecessary. My "marginal cost of Apache" is 1376kB. My "marginal
cost of Mongrel" is 27528kB, with the code I wrote. It seems that the latter
could be reduced a lot by sharing some Ruby libraries.

The question is as follows: if I require 'library' in one instance of Ruby
and then require 'library' again in another instance of Ruby, then do I get
duplicate copies of library's code in two chunks of my RAM? (I'm thinking I
do.) Why?

You'll get closer to the behavior you expect if you use Kernel#fork to spawn new instances rather than starting up from the shell.

This is how Apache costs only 1376kB.

···

On Aug 13, 2007, at 11:25, Matt Harvey wrote:

This paragraph is motivation. While my question is not Rails-specific, I am asking it because of Rails. I've been investigating the memory footprint of my Mongrels. It is nice that they share the .so libraries from ImageMagick as well as other C libraries. However, each one still has about 20MB in [heap]. My theory is that a lot of this is coming from ActiveRecord and friends getting loaded again and again for each Mongrel, which seems to me entirely unnecessary. My "marginal cost of Apache" is 1376kB. My "marginal cost of Mongrel" is 27528kB, with the code I wrote. It seems that the latter could be reduced a lot by sharing some Ruby libraries.

The question is as follows: if I require 'library' in one instance of Ruby and then require 'library' again in another instance of Ruby, then do I get duplicate copies of library's code in two chunks of my RAM? (I'm thinking I do.) Why?

--
Poor workers blame their tools. Good workers build better tools. The
best workers get their tools to do the work for them. -- Syndicate Wars

And one more note: you can save a bit of memory, if you put the thread
safe code into one drb server, although most probably it's not worth
the effort.

···

On 8/13/07, Jano Svitok <jan.svitok@gmail.com> wrote:

On 8/13/07, Matt Harvey <matt@teamdawg.org> wrote:
> This paragraph is motivation. While my question is not Rails-specific, I am
> asking it because of Rails. I've been investigating the memory footprint of
> my Mongrels. It is nice that they share the .so libraries from ImageMagick
> as well as other C libraries. However, each one still has about 20MB in
> [heap]. My theory is that a lot of this is coming from ActiveRecord and
> friends getting loaded again and again for each Mongrel, which seems to me
> entirely unnecessary. My "marginal cost of Apache" is 1376kB. My "marginal
> cost of Mongrel" is 27528kB, with the code I wrote. It seems that the latter
> could be reduced a lot by sharing some Ruby libraries.
>
> The question is as follows: if I require 'library' in one instance of Ruby
> and then require 'library' again in another instance of Ruby, then do I get
> duplicate copies of library's code in two chunks of my RAM? (I'm thinking I
> do.) Why?

I suppose the main problem is that Rails (or ActiveRecord, I don't
know exactly) is not thread-safe. That means you cannot share most of
its the data. That is the reason why you have to run more mongrels
compared to a one multi-threaded mongrel.

I don't know where exactly is the problem with rails/ar though, nor
whether is it at least theoretically solvable.

> This paragraph is motivation. While my question is not Rails-specific, I am
> asking it because of Rails. I've been investigating the memory footprint of
> my Mongrels. It is nice that they share the .so libraries from ImageMagick
> as well as other C libraries. However, each one still has about 20MB in
> [heap]. My theory is that a lot of this is coming from ActiveRecord and
> friends getting loaded again and again for each Mongrel, which seems to me
> entirely unnecessary. My "marginal cost of Apache" is 1376kB. My "marginal
> cost of Mongrel" is 27528kB, with the code I wrote. It seems that the latter
> could be reduced a lot by sharing some Ruby libraries.
>
> The question is as follows: if I require 'library' in one instance of Ruby
> and then require 'library' again in another instance of Ruby, then do I get
> duplicate copies of library's code in two chunks of my RAM? (I'm thinking I
> do.) Why?

I suppose the main problem is that Rails (or ActiveRecord, I don't
know exactly) is not thread-safe. That means you cannot share most of
its the data. That is the reason why you have to run more mongrels
compared to a one multi-threaded mongrel.

It's actually what Wayne mentioned. Since all Ruby classes can be
modified at runtime, it would be very scary to share them across
separate process instances unless you explicitly wanted that behavior.

As a naive example, consider this:

require "set"

=> true

class Set
  def icanhasset
    puts "Oh hai, I is an instance method"
  end
end

Set.icanhasset

Oh hai, I is an instance method

Imagine this shared across separate processes running different types
of code. Any modifications would be shared, and that means that you
couldn't meaningfully modify any classes without expecting problems or
weird bugs. Takes away half of the fun (and utility) of Ruby right
there. :slight_smile:

-greg

···

On 8/13/07, Jano Svitok <jan.svitok@gmail.com> wrote:

On 8/13/07, Matt Harvey <matt@teamdawg.org> wrote:

Thanks for your reply; but I am still wondering about a more general question. It is my understanding that require 'file.rb' will execute the code in file.rb, so if I require the same file in two different ruby processes, then I have duplicates of its classes in memory. If I do require 'c_library.so' then c_library.so will be loaded as a shared library. I understand that one process might want to override some methods in file.rb, while the other one might depend on the original versions being intact. This is a good reason to load the library twice.

In some instances, though, you might know that there are not going to be overrides, perhaps child classes at most. In this case the classes (no objects) could be shared. Is there a way to do "shared Ruby libraries" and have them act like shared C libraries? (This question reveals my ignorance of how shared C libraries and OS kernels interact, but I suspect that C (or any compiled?) libraries are special.) DRb is not really what I am asking about; I mean to share only classes, not objects.

I have plenty of RAM to run my three Mongrel processes, which are already overkill for serving a whopping 30 vists and 900 hits per day. (Shameful plug of http://www.teamdawg.org if you want to help me out with some more load.) Therefore, I am not really trying to do anything, just theorizing.

I have seen widespread criticism of Rails as an poorly-scalable memory hog, to which there are replies of, "Optimize your code," (often due to ActiveRecord::Base.find generating lots of SELECT * queries) or "Buy more RAM and servers until you bring down your database," (which will happen pretty quickly with egregious SELECT *), to "Check your logs; your database is already the problem." I think Rails is great and Ruby is even greater; in fact I want to see them take over the world. It could happen a lot faster if we could address criticisms like the above, and when a library is as large as ActiveRecord, loading even one time too many is already cause for criticism.

Sorry, I started talking about Rails again. The question is not about Rails. The questions are: Is there any way we can have shared Ruby libraries without turning the relevant code into a C extension? Is it necessary that code be compiled to be put into shared memory by the OS? (Feel free to tell me I'm being really stupid.) For instance, for all the GTK applications you run your system needs to load GTK only once. It would be really nice if this could be true of Ruby libraries. I have a feeling that this may just be a limitation of interpreted languages. Please explain.

"Jano Svitok" <jan.svitok@gmail.com> wrote in message news:8d9b3d920708131157o49bcaba5ibd19f4d1ae2a52ba@mail.gmail.com...

···

I suppose the main problem is that Rails (or ActiveRecord, I don't
know exactly) is not thread-safe. That means you cannot share most of
its the data. That is the reason why you have to run more mongrels
compared to a one multi-threaded mongrel.

I don't know where exactly is the problem with rails/ar though, nor
whether is it at least theoretically solvable.

This is not the issue, really, when talking about issues of throughput with Ruby. Only in uncommon cases will a multithreaded Ruby program actually deliver greater processing throughput than a single threaded Ruby program. Basically, you have to have external latencies that can be captured _without_ your code waiting on those latencies inside of an extension.

If you run Rails (or some other web framework app) in a multithreaded mode, then yes, more than one request can be inside of the code, being handled at the same time. The handling of each of those requests will be substantially slower, though, than if a process handles a single request at a time. It may be a win with regard to app behavior, if there are fast actions in the same app with very slow actions, and one only wants to run a single, or a very small number of processing nodes (mongrels), because it lets the fast actions run to completion without requring them to wait on the slow actions. But from the POV of overall throughput, it is not a win.

Kirk Haines

···

On Tue, 14 Aug 2007, Jano Svitok wrote:

I suppose the main problem is that Rails (or ActiveRecord, I don't
know exactly) is not thread-safe. That means you cannot share most of
its the data. That is the reason why you have to run more mongrels
compared to a one multi-threaded mongrel.

* Matt Harvey <matt@teamdawg.org> (22:19) schrieb:

Sorry, I started talking about Rails again. The question is not about
Rails. The questions are: Is there any way we can have shared Ruby
libraries without turning the relevant code into a C extension? Is it
necessary that code be compiled to be put into shared memory by the
OS?

C-libraries have 2 nice features: They are read only, and ready to use
in a disk file. So a modern OS can just map that file into memory and
use the same physical memory for every process accessing the file.

Ruby classes aren't read only. But you could just write the changes into
memory and keep the constant part in a file. So the real problem is that
the source aren't ready to use for a modern interpreter. The data
structures the compiler works on are nowhere on disk, they are
dynamically created when the source files are evaluated.

The source files don't hog up memory, they can be freed after being
parsed or just memory mapped. It's the parsers result, the structures
the interpreter works on, that need the memory.

Ruby would have to use "precompiled" source files to be able to use
memory mapping. It could use copy-on-write to dynamically change the
code. There still much work on Ruby, so there may be a chance for that.

mfg, simon .... l

Matt Harvey wrote:

Sorry, I started talking about Rails again. The question is not about Rails. The questions are: Is there any way we can have shared Ruby libraries without turning the relevant code into a C extension? Is it necessary that code be compiled to be put into shared memory by the OS?

The problem goes further than that. Even if you were to load your libs in one process and then fork off worker processes (using copy-on-write to share loaded code), the gargabe collector writes to *every* page in memory when doing a garbage collecting run, thus negating the benefits of COW. It's fixed in 1.9, thankfully, but 1.8 is going to be a memory hog no matter which way you look at it.

Daniel

Daniel DeLorme wrote:

The problem goes further than that. Even if you were to load your libs in one process and then fork off worker processes (using copy-on-write to share loaded code), the gargabe collector writes to *every* page in memory when doing a garbage collecting run, thus negating the benefits of COW. It's fixed in 1.9, thankfully, but 1.8 is going to be a memory hog no matter which way you look at it.

What does 1.9 do differently?

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

For .so files, no, for .rb files, yes.

···

On Aug 13, 2007, at 20:14, Daniel DeLorme wrote:

Matt Harvey wrote:

Sorry, I started talking about Rails again. The question is not about Rails. The questions are: Is there any way we can have shared Ruby libraries without turning the relevant code into a C extension? Is it necessary that code be compiled to be put into shared memory by the OS?

The problem goes further than that. Even if you were to load your libs in one process and then fork off worker processes (using copy-on-write to share loaded code), the gargabe collector writes to *every* page in memory when doing a garbage collecting run, thus negating the benefits of COW. It's fixed in 1.9, thankfully, but 1.8 is going to be a memory hog no matter which way you look at it.

--
Poor workers blame their tools. Good workers build better tools. The
best workers get their tools to do the work for them. -- Syndicate Wars