[QUIZ #157] 32bit vs 64bit vs UML performance

I've run the QUIZ benchmark on several systems to check their relative performance and had quite a surprise.
I'll take Frank's algorithm as reference for simplicity as it spends roughly the same time for each point distribution in 157_benchmark_2.rb but this is true for all of them.

This is on Gentoo Linux, all systems are compiled with gcc 4.1.2
on 32bit : -O3 -march=i686 -fomit-frame-pointer -pipe
on 64bit : -O2 -pipe

Each benchmark as been run at least twice and at least once while monitoring the swap/cpu usage with vmstat and top and once without monitoring to reduce interference. The results where nearly the same.

First 64bit MRI is slower on the same hardware than 32bit :

* Athlon 64 X2 3800+ (2GHz) 64bit : ~25% faster in a 32bit chroot on the same system (~24s instead of ~30s).
* Core2Duo 64bit : ~25% slower than Core2Duo 32bit too (~18s on E6750 (2.66GHz) instead of ~20s on E6300 (1.83GHz), should be 14s assuming perf = k*GHz which results in the "-25%")

That's not unexpected, but I didn't think it would be so large a performance hit.

Now for the true surprise:

The 32bit Core2Duo E6300 is a system with several user-mode-linux based systems which are idling around (low trafic mail, DNS server and test systems). It sits at a constant 0 load and zero swap activity.
The benchmark is 2x faster *in* the virtual machine (with the very same compilation options, the virtual machine is mostly a clone of the host) : 10s instead of 20s ! That's not a system timer problem, I actually watched it spend the 10s and 20s.

I don't know what's going on... Maybe Ruby is making a very specific system call that happens to be faster with user-mode-linux even though the benchmark consists of nearly 99% floating point operations.

The UML kernel is a 2.6.18 with the UML patches, the host is a 2.6.23.9 with the skas3 patch (designed to help the UML performance). The UML sees 512MB, the host has 1GB. As I said, nearly no concurrent system activity, 0 swapping before during and after the benchmark on both the host and the virtual machine.

Anyone seen something remotely like this ?

Lionel

Err wait, the way you phrased your benchmarks is a bit confusing, post
the total time for each by bitness(32 or 64) by cpu.

As for the intel x86-64 issue, it's become fairly well known that the
32 bit execution time is faster than the 64. With the AMD 64 bit
boxes it greatly depends on the application being run, sometimes 32 is
faster (but not as drastically as in the intel world), but more often
64 is faster.

If you're really interested in squeezing every last living ounce out
of your box, you may want to fine-tune those x86-64 compilation
settings, and you'll probably have to benchmark & test for oft used
libraries and binaries, but use common sense (recompiling ls & tar
won't get you much, but zlib & gzip may help a bit).

One final thing, when benchmarking, it may suck to wait for many runs
to go by, but you really need to do more than two runs :slight_smile:
You probably already know all this, but still, for passers by, drop to
single user mode if possible, make sure no extra processes are
running, and run the test as many times as you can. Maybe that's
four, but hopefully it's twenty, or a hundred.

--Kyle

Lionel Bouton wrote:

I've run the QUIZ benchmark on several systems to check their relative performance and had quite a surprise.
I'll take Frank's algorithm as reference for simplicity as it spends roughly the same time for each point distribution in 157_benchmark_2.rb but this is true for all of them.

This is on Gentoo Linux, all systems are compiled with gcc 4.1.2
on 32bit : -O3 -march=i686 -fomit-frame-pointer -pipe
on 64bit : -O2 -pipe

Hold on!! Try recompiling the 32-bit version with "march=pentium4" and the 64-bit version with "-O3 -march=athlon64" and *then* compare the timings! You've saddled the 64-bit version with some default architecture and one less optimization level than the 32-bit version.

[snip]

The 32bit Core2Duo E6300 is a system with several user-mode-linux based systems which are idling around (low trafic mail, DNS server and test systems). It sits at a constant 0 load and zero swap activity.
The benchmark is 2x faster *in* the virtual machine (with the very same compilation options, the virtual machine is mostly a clone of the host) : 10s instead of 20s ! That's not a system timer problem, I actually watched it spend the 10s and 20s.

I don't know what's going on... Maybe Ruby is making a very specific system call that happens to be faster with user-mode-linux even though the benchmark consists of nearly 99% floating point operations.

On the Athlon, you can figure out what's going on with CodeAnalyst. I would guess it's something to do with cache thrashing or lack thereof. On the Intel, you might be able to get some results from CodeAnalyst -- it's basically a wrapper around "oprofile". But you might end up needing Intel's VTune. If you do this for a living, it's worth spending the money. :slight_smile:

The UML kernel is a 2.6.18 with the UML patches, the host is a 2.6.23.9 with the skas3 patch (designed to help the UML performance). The UML sees 512MB, the host has 1GB. As I said, nearly no concurrent system activity, 0 swapping before during and after the benchmark on both the host and the virtual machine.

Anyone seen something remotely like this ?

Not me ... but I'm an Intel-free zone. :wink:

Kyle Schmitt wrote:

Err wait, the way you phrased your benchmarks is a bit confusing, post
the total time for each by bitness(32 or 64) by cpu.
  
It's more complicated than that as I don't have 32 and 64 bit systems on each of my systems so I had to compare slightly different machines. Anyway, the 64bit 25% slowdown in this benchmark isn't such a big deal in my opinion, I'm more interested as why a virtual machine based on user-mode-linux can be 2x faster than the real hardware it runs on... So I wondered if someone did at least see such behavior before I dig more (running other benchmarks...).

[...]
One final thing, when benchmarking, it may suck to wait for many runs
to go by, but you really need to do more than two runs :slight_smile:
You probably already know all this, but still, for passers by, drop to
single user mode if possible, make sure no extra processes are
running, and run the test as many times as you can. Maybe that's
four, but hopefully it's twenty, or a hundred.

The Quiz #157 benchmark (see Index of /~billk/ruby/quiz/157-smallest-circle/benchmark, the 157_bencharmk_2.rb file) runs each algorithm 10 times on 4 different data sets. I used the algorithm which takes a nearly constant time (FRANK) for all sets and does the most computations. So this was in fact at least 4 * 2 = 8 runs of 10 times the same code.

Lionel

M. Edward (Ed) Borasky wrote:

Lionel Bouton wrote:

I've run the QUIZ benchmark on several systems to check their relative performance and had quite a surprise.
I'll take Frank's algorithm as reference for simplicity as it spends roughly the same time for each point distribution in 157_benchmark_2.rb but this is true for all of them.

This is on Gentoo Linux, all systems are compiled with gcc 4.1.2
on 32bit : -O3 -march=i686 -fomit-frame-pointer -pipe
on 64bit : -O2 -pipe

Hold on!! Try recompiling the 32-bit version with "march=pentium4" and the 64-bit version with "-O3 -march=athlon64" and *then* compare the timings! You've saddled the 64-bit version with some default architecture and one less optimization level than the 32-bit version.

Given that there's only one architecture to choose from, I don't think there would be any benefit in telling gcc to use it... I could have specified -mcpu but I don't want to rely on the exact CPU used (putting disks in another system can be handy).
-O2 is recommended for 64bit as -O3 is often slower and don't give much benefit when there is one.

Anyway, I can live with the fact that Ruby is 25% slower on data crunching when PostgreSQL can fly. In the future I'll simply use 32bit systems for web frontends if my benchmarks confirm this trend.

On the Athlon, you can figure out what's going on with CodeAnalyst. I would guess it's something to do with cache thrashing or lack thereof. On the Intel, you might be able to get some results from CodeAnalyst -- it's basically a wrapper around "oprofile". But you might end up needing Intel's VTune. If you do this for a living, it's worth spending the money. :slight_smile:

These are CPU-level profiling tools, as:
- the CPU is the same in and out of UML,
- I certainly don't have access to the performance counters from user-mode-linux,
they won't be of much use for me (profiling the behavior of user-mode-linux is not what I'm after).

Eventually when I have time to narrow down the problem myself, I'll launch strace on the benchmark, study the differences and submit the list of system calls to the UML coders asking why some can be faster on UML than on the host kernel.

Lionel

Lionel Bouton wrote:

These are CPU-level profiling tools, as:
- the CPU is the same in and out of UML,
- I certainly don't have access to the performance counters from user-mode-linux,
they won't be of much use for me (profiling the behavior of user-mode-linux is not what I'm after).

I disagree here.

1. User-mode Linux is a guest in some host. (Gentoo, wasn't it?) In a sense, UML is the application, even though it's executing code on behalf of the benchmark, which is the application you care about. So profiling the host with *oprofile* will tell you what the whole host is doing, including the UML guest and the benchmark within it.

2. The actual physical processor(s) will be the same in and out of UML, yes. However, what the "OS" does with those processors, especially with respect to caches, will be different. There are other things that could affect this, like branch prediction, or the system call possibility you noted before. oprofile and the CodeAnalyst wrapper will tell you how efficiently the processor is being used in both cases.

Eventually when I have time to narrow down the problem myself, I'll launch strace on the benchmark, study the differences and submit the list of system calls to the UML coders asking why some can be faster on UML than on the host kernel.

Lionel

As long as you're experimenting, you might want to try this with a Xen domu host and dom0 guest. Or a VMware Server. In either case, oprofile on the host should give you some interesting information.

Lionel Bouton wrote:

Eventually when I have time to narrow down the problem myself, I'll launch strace on the benchmark, study the differences and submit the list of system calls to the UML coders asking why some can be faster on UML than on the host kernel.

Ok, there's indeed something odd, but checking the trace shows that it's the results of the "times" system call that are odd. The actual time spent in the script is roughly the same when you use a stopwatch (shows how much I can trust my time awareness...) and the "real" time is OK, but the user and total time are half what I'd expect.
As the times are nearly half what they are in the host I suspect a simple bug like a computation based on the total system CPU time instead of the single core time the virtual machine is using.

So sorry no performance gains in UML :-/

Lionel

Anyway, I can live with the fact that Ruby is 25% slower on data crunching when PostgreSQL can fly. In the future I'll simply use 32bit systems for web frontends if my benchmarks confirm this trend.

As an aside, have you measured a 32-bit ruby binary running on the
64-bit system?

I recall you mentioned trying a 32-bit ruby in a chroot ... but what was
puzzling me was why the chroot was needed? I'm relatively new to
64-bit linux, but I have a couple servers now, running 64-bit debian
systems, like:

Linux fragbait 2.6.19.2-amd64.080215-grsec #1 SMP Sun Nov 18 06:48:46 PST 2007 x86_64 GNU/Linux

And they have a /lib32 as well as a /lib64, and provided I install (with apt-get) the 32-bit versions of whatever libraries are needed,
they seem perfectly happy to run 32-bit binaries right alongside
64-bit binaries, no chroot needed...

So it seems you ought to be able to run 32-bit ruby and 64-bit
PostgreSQL side by side with no problem?

Regards,

Bill

···

From: "Lionel Bouton" <lionel-subscription@bouton.name>

M. Edward (Ed) Borasky wrote:

[snip]

Speaking of profiles, I just ran a "gprof" version of MRI on the two benchmarks. Gory details:

Machine is an Athlon64 X2 running Gentoo Linux in 64-bit mode
GCC 4.2.3
gcc -g -pg compilation flags
ruby-1.8.6-p111 source

The profiles are in http://cougar.rubyforge.org/svn/trunk/ProfilingAndTuningRuby/Quiz_157_Benchmark/b1.gprof

and

http://cougar.rubyforge.org/svn/trunk/ProfilingAndTuningRuby/Quiz_157_Benchmark/b2.gprof

They're pretty similar, and "typical" MRI profiles, much like what I saw on more comprehensive tests. Here's the first 20 lines of the flat profile for benchmark 1:

Flat profile:

Each sample counts as 0.01 seconds.
   % cumulative self self total
  time seconds seconds calls s/call s/call name
  20.39 8.51 8.51 22293906 0.00 0.00 rb_eval
  11.41 13.27 4.76 81870099 0.00 0.00 rb_call0
   7.84 16.54 3.27 96913308 0.00 0.00 gc_mark
   6.33 19.18 2.64 81870099 0.00 0.00 rb_call
   5.66 21.54 2.36 31279299 0.00 0.00 gc_mark_children
   5.34 23.77 2.23 46655114 0.00 0.00 st_lookup
   3.74 25.33 1.56 13302985 0.00 0.00 rb_yield_0
   3.43 26.76 1.43 69195311 0.00 0.00 call_cfunc
   2.85 27.95 1.19 294 0.00 0.00 gc_sweep
   2.71 29.08 1.13 3018851 0.00 0.00 st_free_table
   2.60 30.17 1.09 83875387 0.00 0.00 rb_class_of
   1.85 30.94 0.77 69209554 0.00 0.00 rb_newobj
   1.53 31.58 0.64 3231034 0.00 0.00 st_foreach
   1.29 32.12 0.54 31984484 0.00 0.00 obj_free
   1.20 32.62 0.50 7085414 0.00 0.00 st_insert
   1.15 33.10 0.48 99605804 0.00 0.00 rb_special_const_p
   0.72 33.40 0.30 17409278 0.00 0.00 rb_dvar_ref
   0.69 33.69 0.29 7048896 0.00 0.00 rb_ivar_set
   0.67 33.97 0.28 24223139 0.00 0.00 rb_type
   0.65 34.24 0.27 24469412 0.00 0.00 new_dvar

That kind of implies that UML is winning because the garbage collector (gc_mark, gc_mark_children) is more efficient in UML than in the "host". If someone wants to tweak the code so it does less garbage collection, they might see an overall performance improvement and less of a delta between UML and the host.

Once I get my other weekend project done (some computer music stuff in mixed Ruby and R) I might push these benchmarks into the main line of the ProfilingAndTuningRuby project and re-run it.

Bill Kelly wrote:

From: "Lionel Bouton" <lionel-subscription@bouton.name>

Anyway, I can live with the fact that Ruby is 25% slower on data crunching when PostgreSQL can fly. In the future I'll simply use 32bit systems for web frontends if my benchmarks confirm this trend.

As an aside, have you measured a 32-bit ruby binary running on the
64-bit system?

Yes, the Athlon64 performance were compared by running a 32bit Ruby on a 64bit system in a 32bit chroot.

I recall you mentioned trying a 32-bit ruby in a chroot ... but what was
puzzling me was why the chroot was needed?

It isn't, but I have a chroot ready so it's easier for me to simply chroot into it when needing to run 32bit code. Few packages are available as 32bit on 64bit Gentoo :
- binary-only packages,
- packages that may link to 32 bit libraries like mplayer or firefox,
- common libraries needed by the 2 previous cases.

So the purpose of the chroot is to have a full-blown 32bit system at my disposal where I can install whatever is already packaged by Gentoo without looking for a binary or cross-compiling it myself. For example, if I want to compare PostgreSQL 32bit and 64bit performance, I only have to install PostgreSQL in the chroot with the package manager, make sure it isn't started in the main system and start it from the chroot.
Of course I still run with a 64bit kernel, so when I detect a change of behavior that I must be sure of I have to validate on a complete 32bit system but usually the kernel hasn't a big impact, globally running Linux 32bit code on a 64bit kernel instead of a 32bit one doesn't change the performance characteristics much.

  I'm relatively new to
64-bit linux, but I have a couple servers now, running 64-bit debian
systems, like:

Linux fragbait 2.6.19.2-amd64.080215-grsec #1 SMP Sun Nov 18 06:48:46 PST 2007 x86_64 GNU/Linux

And they have a /lib32 as well as a /lib64, and provided I install (with apt-get) the 32-bit versions of whatever libraries are needed,
they seem perfectly happy to run 32-bit binaries right alongside
64-bit binaries, no chroot needed...

So it seems you ought to be able to run 32-bit ruby and 64-bit
PostgreSQL side by side with no problem?

I could, but I'd have to do without Portage to maintain all things ruby dependent as I wouldn't be able to use it to install Ruby. The same is probably true on any distribution, if you want to test both 32bit and 64bit binaries of the same package, it's not convenient to install them on the same system.

Lionel

The key here is to run 32bit and 64bit on the same hardware, under
32bit and 64bit installs. It's a pain I know. Then again.... I _may_
just have some boxes in our development environment I can poke with a
stick....
How long does this suite take for one run?

--Kyle

Kyle Schmitt wrote:

The key here is to run 32bit and 64bit on the same hardware, under
32bit and 64bit installs. It's a pain I know. Then again.... I _may_
just have some boxes in our development environment I can poke with a
stick....
How long does this suite take for one run?
  
Around 3 minutes on my laptop, I doubt you'll find a 64bit-capable system less powerful :slight_smile:

Kyle Schmitt wrote:

The key here is to run 32bit and 64bit on the same hardware, under
32bit and 64bit installs. It's a pain I know. Then again.... I _may_
just have some boxes in our development environment I can poke with a
stick....
How long does this suite take for one run?

--Kyle

A lot less time than it takes to do two Linux installs. :wink:

OK, I didn't have a 32bit install on a 64 bit capable box that was
unused, so I compiled a 32bit and a 64bit ruby stable on one of the
larger boxes (2 dual core 3ghz opterons, 16gb of ram, aside from
forcing 64 and 32 bits, they had the same compilation flags) and let
it run through those benchmarks overnight. Setting it was a good way
to spend 10 minutes.

The interesting part is that the 32bit build ran slightly faster than
the 64bit build, even though this is a 64 bit linux box.
The average time for a complete run of the benchmark differed by 30 seconds.
32 bit:130.19040000000000000000
64 bit:160.51990000000000000000

I'll be the first to admit this wasn't the best test, considering the
box wasn't in single user mode, but each run was niced to -20, there
were several gigs of ram free even during the tests, and the system
load before the test started was under 0.3.

Now the question is, what does that actually _tell_ us?
Here are the options as far as I can see, but I'd be interested to
know what others there may be...
1) The load time for 64 bit ruby takes longer
2) The core ruby language needs some re-writes/tuning to work properly
for 64 bit.
3) or.. just a few portions of the ruby language need re-writes/tuning
to work properly for 64 bit.

--Kyle

Kyle Schmitt wrote:

OK, I didn't have a 32bit install on a 64 bit capable box that was
unused, so I compiled a 32bit and a 64bit ruby stable on one of the
larger boxes (2 dual core 3ghz opterons, 16gb of ram, aside from
forcing 64 and 32 bits, they had the same compilation flags) and let
it run through those benchmarks overnight. Setting it was a good way
to spend 10 minutes.

The interesting part is that the 32bit build ran slightly faster than
the 64bit build, even though this is a 64 bit linux box.
The average time for a complete run of the benchmark differed by 30 seconds.
32 bit:130.19040000000000000000
64 bit:160.51990000000000000000

I'll be the first to admit this wasn't the best test, considering the
box wasn't in single user mode, but each run was niced to -20, there
were several gigs of ram free even during the tests, and the system
load before the test started was under 0.3.

Now the question is, what does that actually _tell_ us?
Here are the options as far as I can see, but I'd be interested to
know what others there may be...
1) The load time for 64 bit ruby takes longer
2) The core ruby language needs some re-writes/tuning to work properly
for 64 bit.
3) or.. just a few portions of the ruby language need re-writes/tuning
to work properly for 64 bit.

--Kyle

If my profiles for the benchmark are to be believed (and I have no reason to doubt them) it tells us that method dispatch and garbage collection are slower on 64 bit Ruby than 32 bit Ruby. One other piece of information that would be interesting is the difference in executable size. I would expect the 64-bit executable to be larger if entries are getting properly aligned on doubleword boundaries. And that in turn would mean on average, less of the executable would be in the "instruction cache" for the 64-bit one. Try a "size <ruby executable

" and see what the difference is.

CodeAnalyst will answer all these questions for you. :slight_smile:

$ size ruby-32
   text data bss dec hex filename
1226561 5196 70904 1302661 13e085 ruby-32
$ size ruby-64
   text data bss dec hex filename
1411988 8696 129488 1550172 17a75c ruby-64

Humm.....

--Kyle

Kyle Schmitt wrote:

$ size ruby-32
   text data bss dec hex filename
1226561 5196 70904 1302661 13e085 ruby-32
$ size ruby-64
   text data bss dec hex filename
1411988 8696 129488 1550172 17a75c ruby-64

Humm.....

--Kyle

Ayup ... 'tis as I suspected. :slight_smile: I sense a possible cache miss problem here.

I just downloaded the latest CodeAnalyst beta, but I haven't been able to get it to start up yet. Perhaps early March I'll play with it further.