Perf optimization using profile results

I need some help with optimizing a set of libraries that I use. They are ffi-rzmq, zmqmachine and rzmq_brokers (all up on github).

I ran a 'latency' test using the highest-level library (rzmq_brokers) and noticed that, as I add each network client to the test, each iteration takes another 300 microseconds to complete. From my testing, JRuby is far and away the fastest runtime executing my code so I have used it to gather some profile statistics. Right now I have some local changes in my zmqmachine and rzmq_brokers repositories that haven't been pushed to github, so another party can't currently reproduce what I am seeing. I am working right now to clean up those commits so that I can push them out.

I did two runs with JRuby using the --server option. The first run was with --profile to produce a flat profile results. The second run was with --profile.grah to produce the graph profile results.

I would be grateful if another set (or several sets) of eyes could take a look at the profile results and suggest places I should focus on for reducing that 300 usec latency per client. I have already reduced it from 700 usec to 300 usec by replacing a regular Array with a SortedArray (where data is stored sorted so that lookups are fast). I'm hoping to get another 50% reduction.

Thanks!

cr

In case someone wants to try and duplicate what I am doing, all of the code has been pushed up to github. I'm really hoping someone can look at the profile results and suggest 2-3 places that are ripe for optimization, but in the event you are feeling frisky now you can reproduce the whole shebang on your own machine.

The aforementioned code repositories are:

git://github.com/chuckremes/ffi-rzmq.git

git://github.com/chuckremes/zmqmachine.git

git://github.com/chuckremes/rzmq_brokers.git

In rzmq_brokers/test, run:

  ruby test_consensus.rb <any open port number, e.g. 5556>

The test adds a new network client for each 10_000 iterations and prints the average round-trip latency of those 10k messages. As each client is added, latency goes up about 300 usec (on my machine, YMMV). I would like to reduce that number.

cr

···

On Dec 13, 2011, at 9:57 AM, Chuck Remes wrote:

I need some help with optimizing a set of libraries that I use. They are ffi-rzmq, zmqmachine and rzmq_brokers (all up on github).

I ran a 'latency' test using the highest-level library (rzmq_brokers) and noticed that, as I add each network client to the test, each iteration takes another 300 microseconds to complete. From my testing, JRuby is far and away the fastest runtime executing my code so I have used it to gather some profile statistics. Right now I have some local changes in my zmqmachine and rzmq_brokers repositories that haven't been pushed to github, so another party can't currently reproduce what I am seeing. I am working right now to clean up those commits so that I can push them out.

I did two runs with JRuby using the --server option. The first run was with --profile to produce a flat profile results. The second run was with --profile.grah to produce the graph profile results.

profile.txt · GitHub

I would be grateful if another set (or several sets) of eyes could take a look at the profile results and suggest places I should focus on for reducing that 300 usec latency per client. I have already reduced it from 700 usec to 300 usec by replacing a regular Array with a SortedArray (where data is stored sorted so that lookups are fast). I'm hoping to get another 50% reduction.

Thanks!

A couple quick observations.

* I assume it's expected that the majority of time is spent in
zmq_poll, which is an FFI Call to zmq. Excluding that from
consideration...

* This is bad:

16.57 0.76 15.81 3079083 Class#new

Three million new classes created. I'm sure it's singleton somewhere,
but that's often a big perf sink...not necessarily because it impacts
straight-line code a lot, but because a class object is a nontrivial
thing to be creating three million of.

* Going down the list, I don't see anything obvious to optimize in
most of the hot methods. Usually when you have normalish-looking code
that's really hot, the next thing to optimize is reducing object
allocation as much as possible. In fact, that more than anything often
improves perf on JRuby...but I'm not sure the effect on other impls.

- Charlie

···

On Tue, Dec 13, 2011 at 10:32 AM, Chuck Remes <cremes.devlist@mac.com> wrote:

On Dec 13, 2011, at 9:57 AM, Chuck Remes wrote:

I need some help with optimizing a set of libraries that I use. They are ffi-rzmq, zmqmachine and rzmq_brokers (all up on github).

I ran a 'latency' test using the highest-level library (rzmq_brokers) and noticed that, as I add each network client to the test, each iteration takes another 300 microseconds to complete. From my testing, JRuby is far and away the fastest runtime executing my code so I have used it to gather some profile statistics. Right now I have some local changes in my zmqmachine and rzmq_brokers repositories that haven't been pushed to github, so another party can't currently reproduce what I am seeing. I am working right now to clean up those commits so that I can push them out.

I did two runs with JRuby using the --server option. The first run was with --profile to produce a flat profile results. The second run was with --profile.grah to produce the graph profile results.

profile.txt · GitHub

I would be grateful if another set (or several sets) of eyes could take a look at the profile results and suggest places I should focus on for reducing that 300 usec latency per client. I have already reduced it from 700 usec to 300 usec by replacing a regular Array with a SortedArray (where data is stored sorted so that lookups are fast). I'm hoping to get another 50% reduction.

Thanks!

In case someone wants to try and duplicate what I am doing, all of the code has been pushed up to github. I'm really hoping someone can look at the profile results and suggest 2-3 places that are ripe for optimization, but in the event you are feeling frisky now you can reproduce the whole shebang on your own machine.

The aforementioned code repositories are:

git://github.com/chuckremes/ffi-rzmq.git

git://github.com/chuckremes/zmqmachine.git

git://github.com/chuckremes/rzmq_brokers.git

In rzmq_brokers/test, run:

ruby test_consensus.rb <any open port number, e.g. 5556>

The test adds a new network client for each 10_000 iterations and prints the average round-trip latency of those 10k messages. As each client is added, latency goes up about 300 usec (on my machine, YMMV). I would like to reduce that number.

cr

A couple quick observations.

* I assume it's expected that the majority of time is spent in
zmq_poll, which is an FFI Call to zmq. Excluding that from
consideration...

Yes. The call to zmq_poll is essentially time spent blocked on I/O.

* This is bad:

16.57 0.76 15.81 3079083 Class#new

Three million new classes created. I'm sure it's singleton somewhere,
but that's often a big perf sink...not necessarily because it impacts
straight-line code a lot, but because a class object is a nontrivial
thing to be creating three million of.

Is there a way for JRuby to output more detail on those classes? For example, can it tell me how many of each class type have been allocated? If so, I could use that information to figure out the critical path and try to reduce the amount of class creation there.

Also, I'm not certain what you mean by "I'm sure it's singleton somewhere." I know what a singleton is, but I don't understand your use of it in this context. Elaborate, please.

* Going down the list, I don't see anything obvious to optimize in
most of the hot methods. Usually when you have normalish-looking code
that's really hot, the next thing to optimize is reducing object
allocation as much as possible. In fact, that more than anything often
improves perf on JRuby...but I'm not sure the effect on other impls.

I'll take a shot at reducing my object allocation. More information on how many of each type of object is being allocated would be a wonderful statistic to have. Also, if there were a way to trace that allocation back to a particular file and line number (for those cases where the same object is allocated in multiple places), that would be a killer profiling feature. <hint, hint>

Thanks for the pointers. I appreciate the help.

cr

···

On Dec 13, 2011, at 7:03 PM, Charles Oliver Nutter wrote:

Three million new classes created. I'm sure it's singleton somewhere,
but that's often a big perf sink...not necessarily because it impacts
straight-line code a lot, but because a class object is a nontrivial
thing to be creating three million of.

Is there a way for JRuby to output more detail on those classes? For example, can it tell me how many of each class type have been allocated? If so, I could use that information to figure out the critical path and try to reduce the amount of class creation there.

...

Also, I'm not certain what you mean by "I'm sure it's singleton somewhere." I know what a singleton is, but I don't understand your use of it in this context. Elaborate, please.

Actually, it may not be a singleton object/class, now that I think
about it. Someone, somewhere is doing a Class.new { } (or similar) in
critical path code.

I'll take a shot at reducing my object allocation. More information on how many of each type of object is being allocated would be a wonderful statistic to have. Also, if there were a way to trace that allocation back to a particular file and line number (for those cases where the same object is allocated in multiple places), that would be a killer profiling feature. <hint, hint>

Indeed it would. Luckily, the JVM already provides it.

It will be very slow, but Sun/Oracle/OpenJDK has an object allocation
profiler built in. Examples:

jruby -J-Xrunhprof ... # run with object allocation profiling,
tracking allocation call stacks 5 deep
jruby -J-Xrunhprof:depth=10 ... # ... call stacks 10 deep

Running java -Xrunhprof:help will provide full help. As I say, it will
be *slow* but you don't need to run a long benchmark; you're looking
for aggregate numbers of objects for some number of iterations, which
won't change drastically over a long run. (There's also a full
instrumented performance profiler in "runhprof", but it's *dead* slow,
because it instruments every method in the system, including every JDK
class.)

The object allocation profiles will be dumped to a gigantic log file
called java.hprof.txt. The primary interesting bit of information is
at the end...a list of object types allocated plus numeric references
to the stack trace where they're allocated. Another example:

Command:

$ jruby -J-Xrunhprof:depth=20 -e "1_000.times { 'foo' + 'bar' }"

A few elements of interest:

   12 0.39% 14.11% 32000 1000 32000 1000 336225 org.jruby.RubyString
   13 0.39% 14.51% 32000 1000 32000 1000 336226 org.jruby.RubyString
   14 0.39% 14.90% 31968 999 31968 999 336239 org.jruby.RubyString

(Most of the top 20 allocated elements are byte, which doesn't tell you much)

Note that almost 3000 RubyString objects have been allocated. 1000 of
those are 'foo', 1000 of them are 'bar', and 999 of them are the
combined string. It's not exactly accurate because the script
terminates before the final allocation gets recorded.

The last number before the class name is the allocation trace. Looking
at 336225 (search in the same file):

TRACE 336225:
        org.jruby.RubyBasicObject.<init>(RubyBasicObject.java:219)
        org.jruby.RubyObject.<init>(RubyObject.java:94)
        org.jruby.RubyString.<init>(RubyString.java:396)
        org.jruby.RubyString.<init>(RubyString.java:424)
        org.jruby.RubyString.newStringShared(RubyString.java:522)
        org.jruby.ast.executable.RuntimeCache.getString(RuntimeCache.java:105)
        org.jruby.ast.executable.AbstractScript.getString0(AbstractScript.java:165)
        ruby.__dash_e__.block_0$RUBY$__file__(-e:1)
        ruby$__dash_e__$block_0$RUBY$__file__.call(ruby$__dash_e__$block_0$RUBY$__file__:65535)
        org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:112)
        org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:95)
        org.jruby.runtime.Block.yield(Block.java:130)
        org.jruby.RubyFixnum.times(RubyFixnum.java:261)
        org.jruby.RubyFixnum$INVOKER$i$0$0$times.call(RubyFixnum$INVOKER$i$0$0$times.gen:65535)
        org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:302)
        org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:144)
        org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:153)
        ruby.__dash_e__.__file__(-e:1)
        ruby.__dash_e__.load(-e:Unknown line)
        org.jruby.Ruby.runScript(Ruby.java:731)

This takes us all the way back to runScript in org.jruby.Ruby. The
-e:1 lines are our inline script. RubyFixnum.times is the "times"
call. And the top seven lines are the parts of JRuby called to
allocate a new literal String. Looking at the 336239 trace...

TRACE 336239:
        org.jruby.RubyBasicObject.<init>(RubyBasicObject.java:219)
        org.jruby.RubyObject.<init>(RubyObject.java:94)
        org.jruby.RubyString.<init>(RubyString.java:396)
        org.jruby.RubyString.newString(RubyString.java:487)
        org.jruby.RubyString.op_plus(RubyString.java:1039)
        org.jruby.RubyString$INVOKER$i$1$0$op_plus.call(RubyString$INVOKER$i$1$0$op_plus.gen:65535)
        org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:167)
        ruby.__dash_e__.block_0$RUBY$__file__(-e:1)
        ruby$__dash_e__$block_0$RUBY$__file__.call(ruby$__dash_e__$block_0$RUBY$__file__:65535)
        org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:112)
        org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:95)
        org.jruby.runtime.Block.yield(Block.java:130)
        org.jruby.RubyFixnum.times(RubyFixnum.java:261)
        org.jruby.RubyFixnum$INVOKER$i$0$0$times.call(RubyFixnum$INVOKER$i$0$0$times.gen:65535)
        org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:302)
        org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:144)
        org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:153)
        ruby.__dash_e__.__file__(-e:1)
        ruby.__dash_e__.load(-e:Unknown line)
        org.jruby.Ruby.runScript(Ruby.java:731)

This is the String#plus call, org.jruby.RubyString.op_plus.

Incidentally, this is also where "class reification" comes in handy:

$ jruby -J-Xrunhprof:depth=20 -Xreify.classes=true -e "class Foo; end;
1_000.times { Foo.new }"

...

   17 0.29% 15.50% 23616 984 23616 984 337103 rubyobj.Foo

TRACE 337103:
        org.jruby.RubyBasicObject.<init>(RubyBasicObject.java:219)
        org.jruby.RubyObject.<init>(RubyObject.java:94)
        rubyobj.Foo.<init>(<Unknown Source>:Unknown line)
        sun.reflect.GeneratedConstructorAccessor4.newInstance(<Unknown

:Unknown line)

        sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        java.lang.Class.newInstance0(Class.java:355)
        java.lang.Class.newInstance(Class.java:308)
        org.jruby.RubyClass$3.allocate(RubyClass.java:142)
        org.jruby.RubyClass.allocate(RubyClass.java:223)
        org.jruby.RubyClass.newInstance(RubyClass.java:809)
        org.jruby.RubyClass$INVOKER$i$newInstance.call(RubyClass$INVOKER$i$newInstance.gen:65535)
        org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroOrNBlock.call(JavaMethod.java:256)
        org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:133)
        ruby.__dash_e__.block_0$RUBY$__file__(-e:1)
        ruby$__dash_e__$block_0$RUBY$__file__.call(ruby$__dash_e__$block_0$RUBY$__file__:65535)
        org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:112)
        org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:95)
        org.jruby.runtime.Block.yield(Block.java:130)
        org.jruby.RubyFixnum.times(RubyFixnum.java:261)

runhprof is a very simple way to get allocation info, and I've used it
many times to trim allocation overhead.

If you find this information useful, perhaps you can reformat it for a
JRuby wiki page :slight_smile:

- Charlie

···

On Tue, Dec 13, 2011 at 8:03 PM, Chuck Remes <cremes.devlist@mac.com> wrote:

For those of you following along, I was able to shave off another 50 microseconds per client. I learned a few things along the way.

1. Most Ruby runtimes cannot optimize a call to #super with implicit arguments very well.

e.g.
  def foo(a, b, c)
    super # <- slow because it has to figure out to pass (a,b,c)
    a + b + c
  end

It is more efficient to pass the specific arguments that the super-method should receive. If it should not receive any, use #super(). This let's the runtime avoid the work of interrogating the current method for its arguments and remarshalling them for the super-method.

2. When using FFI and FFI::Struct, it is *not* efficient to call MyStruct.size on a regular basis. The FFI library has to recalculate the size of the structure each time and return that value. Stash the size in an ivar somewhere and use that instead of calling #size.

3. When storing a few values, it's sometimes cheap & easy to just use a Hash. But in hot code, the work to look the element up is sometimes not worth it. If you are only storing a few values, use an ivar.

4. Constants in Ruby are (unfortunately) not really constant. You can change the value of a constant though the runtime will kick out a warning. Since the Constants aren't constant, each access of a constant can incur a guarded lookup. The guard is there to see if the value referred to by the constant has changed since the last access. Again, in hot code this extra check can add up particularly if you are using lots of constants. Again, just stash the value in an ivar where lookup is fast.

5. As always, there is no faster code than "no code." I found a few spots where I was recalculating the same value 3 times in a hot path. The calculation was fairly cheap, but when that hot path is called millions of times even the cheap calcs can be expensive.

@value ||= cheap_when_called_once_but_expensive_when_called_millions_of_times

That does the trick. Wow, got to love those ivars.

6. In hot code when looking up a value in an array, it's a tad faster to use #at instead of #[]. The reason is that #[] can take different arguments other than a simple index (like a Range). Those extra argument checks can add up.

These are all micro-optimizations and may not even be valid on your runtime of choice. However, if you are running time critical code and your lingua franca is Ruby, then some of these tricks can eek out a few extra microseconds for you.

cr

That's a full profiler. You can remedy the performance hit with option
cpu=samples significantly. The inaccuracy introduced by this is
probably negligible since hotspots will still be at the top. Results
can be analyzed with the free tool HPJMeter from HP.
http://www.javaperformancetuning.com/tools/hpjmeter/index.shtml

There is a more lightweight alternative for just looking at instance
counts and sizes per class:

$ jmap -histo <pid>

Process usually can be found via jps, ps or other means.

Kind regards

robert

···

On Wed, Dec 14, 2011 at 9:04 AM, Charles Oliver Nutter <headius@headius.com> wrote:

On Tue, Dec 13, 2011 at 8:03 PM, Chuck Remes <cremes.devlist@mac.com> wrote:

Also, I'm not certain what you mean by "I'm sure it's singleton somewhere." I know what a singleton is, but I don't understand your use of it in this context. Elaborate, please.

Actually, it may not be a singleton object/class, now that I think
about it. Someone, somewhere is doing a Class.new { } (or similar) in
critical path code.

I'll take a shot at reducing my object allocation. More information on how many of each type of object is being allocated would be a wonderful statistic to have. Also, if there were a way to trace that allocation back to a particular file and line number (for those cases where the same object is allocated in multiple places), that would be a killer profiling feature. <hint, hint>

Indeed it would. Luckily, the JVM already provides it.

It will be very slow, but Sun/Oracle/OpenJDK has an object allocation
profiler built in. Examples:

jruby -J-Xrunhprof ... # run with object allocation profiling,
tracking allocation call stacks 5 deep
jruby -J-Xrunhprof:depth=10 ... # ... call stacks 10 deep

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Three million new classes created. I'm sure it's singleton somewhere,
but that's often a big perf sink...not necessarily because it impacts
straight-line code a lot, but because a class object is a nontrivial
thing to be creating three million of.

Is there a way for JRuby to output more detail on those classes? For example, can it tell me how many of each class type have been allocated? If so, I could use that information to figure out the critical path and try to reduce the amount of class creation there.

...

Also, I'm not certain what you mean by "I'm sure it's singleton somewhere." I know what a singleton is, but I don't understand your use of it in this context. Elaborate, please.

Actually, it may not be a singleton object/class, now that I think
about it. Someone, somewhere is doing a Class.new { } (or similar) in
critical path code.

Okay, so I should look for lines where I execute "Class.new {}" in my code. Off the top of my head, it is *rare* that I call Class.new directly; it's almost always via an actual class constant name.

I'll take a shot at reducing my object allocation. More information on how many of each type of object is being allocated would be a wonderful statistic to have. Also, if there were a way to trace that allocation back to a particular file and line number (for those cases where the same object is allocated in multiple places), that would be a killer profiling feature. <hint, hint>

Indeed it would. Luckily, the JVM already provides it.
[snip]
If you find this information useful, perhaps you can reformat it for a
JRuby wiki page :slight_smile:

I find it *very* useful and I *will* add it to the wiki. Thanks for taking the time to post this!

cr

···

On Dec 14, 2011, at 2:04 AM, Charles Oliver Nutter wrote:

On Tue, Dec 13, 2011 at 8:03 PM, Chuck Remes <cremes.devlist@mac.com> wrote:

Exactly.

Array#at shouldn't be faster than Array# on MRI 1.9+ with a single arg
(ref: insns.def).

···

Chuck Remes <cremes.devlist@mac.com> wrote:

6. In hot code when looking up a value in an array, it's a tad faster
to use #at instead of #. The reason is that # can take different
arguments other than a simple index (like a Range). Those extra
argument checks can add up.

These are all micro-optimizations and may not even be valid on your
runtime of choice.

jruby -J-Xrunhprof ... # run with object allocation profiling,
tracking allocation call stacks 5 deep
jruby -J-Xrunhprof:depth=10 ... # ... call stacks 10 deep

That's a full profiler. You can remedy the performance hit with option
cpu=samples significantly. The inaccuracy introduced by this is
probably negligible since hotspots will still be at the top. Results
can be analyzed with the free tool HPJMeter from HP.
Tool Report: HPjmeter

For CPU, yes. This is not CPU profiling, however, and runhprof does
not have a sampling mode for allocations.

VisualVM, on the other hand, does have such a mode. It's usually my
second tool of choice for allocation analysis.

And of course you're right...there's many, many other tools for JVM
that can give some of the same and some different information. Most
work just dandy with JRuby.

There is a more lightweight alternative for just looking at instance
counts and sizes per class:

$ jmap -histo <pid>

Indeed. jmap is very useful for getting a "moment in time" snapshot of
in-memory objects.

I've covered several of these tools in my memory-related series of blog posts:

And similar (possibly duplicate) posts on EY blog:

http://www.engineyard.com/blog/2010/monitoring-memory-with-jruby-part-1-jhat-and-visualvm/

http://www.engineyard.com/blog/2010/montoring-memory-with-jruby-part-2-eclipse-memory-analyzer/

Tip of the iceberg.

- Charlie

···

On Wed, Dec 14, 2011 at 2:54 AM, Robert Klemme <shortcutter@googlemail.com> wrote:

Okay, so I should look for lines where I execute "Class.new {}" in my code. Off the top of my head, it is *rare* that I call Class.new directly; it's almost always via an actual class constant name.

They'll look something like this:

    4 1.57% 42.08% 159840 999 159840 999 325078 org.jruby.MetaClass
    5 1.57% 43.64% 159840 999 159840 999 325074 org.jruby.RubyClass

The MetaClass is the class's metaclass, and RubyClass is the class itself.

- Charlie

···

On Wed, Dec 14, 2011 at 7:47 AM, Chuck Remes <cremes.devlist@mac.com> wrote:

I'll take a shot at reducing my object allocation. More information on how many of each type of object is being allocated would be a wonderful statistic to have. Also, if there were a way to trace that allocation back to a particular file and line number (for those cases where the same object is allocated in multiple places), that would be a killer profiling feature. <hint, hint>

Indeed it would. Luckily, the JVM already provides it.
[snip]
If you find this information useful, perhaps you can reformat it for a
JRuby wiki page :slight_smile:

I find it *very* useful and I *will* add it to the wiki. Thanks for taking the time to post this!

cr

Indeed. Array# is slower on JRuby because String# uses backref
($~), and since we can't tell at compile time whether we're calling
Array# or String#, we always prepare space for backref. There's no
#at method that uses backref, so we can elide the allocation of space
for backref.

I plan to reduce the cost of backref allocation, but it's not there
yet. In any case, JRuby's quite fast either way.

- Charlie

···

On Thu, Dec 15, 2011 at 5:30 PM, Eric Wong <normalperson@yhbt.net> wrote:

Chuck Remes <cremes.devlist@mac.com> wrote:

6. In hot code when looking up a value in an array, it's a tad faster
to use #at instead of #. The reason is that # can take different
arguments other than a simple index (like a Range). Those extra
argument checks can add up.

These are all micro-optimizations and may not even be valid on your
runtime of choice.

Exactly.

Array#at shouldn't be faster than Array# on MRI 1.9+ with a single arg
(ref: insns.def).

It took me a while to get around to it, but I added it to the wiki. Check it out here:

cr

···

On Dec 14, 2011, at 7:47 AM, Chuck Remes wrote:

On Dec 14, 2011, at 2:04 AM, Charles Oliver Nutter wrote:

Okay, so I should look for lines where I execute "Class.new {}" in my code. Off the top of my head, it is *rare* that I call Class.new directly; it's almost always via an actual class constant name.

I'll take a shot at reducing my object allocation. More information on how many of each type of object is being allocated would be a wonderful statistic to have. Also, if there were a way to trace that allocation back to a particular file and line number (for those cases where the same object is allocated in multiple places), that would be a killer profiling feature. <hint, hint>

Indeed it would. Luckily, the JVM already provides it.
[snip]
If you find this information useful, perhaps you can reformat it for a
JRuby wiki page :slight_smile:

I find it *very* useful and I *will* add it to the wiki. Thanks for taking the time to post this!