For performance, write it in C - Part 3, Source code now available

The source code is available from http://peterhi.dyndns.org/write_it_in_c/index.html

Great! Thanks.

···

On Tue, Aug 01, 2006 at 05:48:41PM +0900, Peter Hickman wrote:

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
Brian K. Reid: "In computer science, we stand on each other's feet."

Peter Hickman wrote:

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

There are some details missing from the webpages
1) which C implementation?
2) which Java implementation?
3) what hardware?

For example, using the code from the webpage

1) gcc version 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0, pie-8.7.8)

gcc -pipe -Wall -O3 -fomit-frame-pointer -funroll-loops -march=pentium4
latin.c -o latinc
time ./latinc > /dev/null 2>&1

user 0m0.820s
sys 0m0.000s

2) /sun-jdk-1.5.0.07/bin/javac Latin.java
time java Latin > /dev/null 2>&1

user 0m3.800s
sys 0m0.644s

3) 2GHz Intel P4

Peter Hickman wrote:

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

Hmm, it would look much better for me if you had included Kristof
Bastiaensen's Curry version
into the party... Heck, if that code is not smart then nothing is, and
in a way, it's a much more
interesting question how a compiled functional language compares to
compiled imepative ones
than the thousand time discussed interpreted vs. compiled match-up.

Yes, Curry is anything but mainstream, but you can't say a decent
compiler is not at your disposal.
The Munster CC (which was used by Kristof, and AFAIK that's the most
commonly used implementation) does even have an IDE for OS X.

Regards,
Csaba

Here is another version in Ruby. It uses a more clever algorithm.
It makes use of the fact that permuting the rows or the columns of
a latin square still gives a latin square. It also passes a constraint
on the columns that the given number can be in, to reduce the backtracking
even more.

It runs in 15 seconds on my machine, which I find quite acceptable.

I am sure that this algorithm coded in a compiled functional or logic
language would be about as fast, or faster than your C code, but with the
advantages of a higher level language.

I still think you benchmark is flawed, because
  1) the C program works only for squares of size 5
  2) It shows totally no relevance to the kind of problems that you
     would use Ruby for. If you want raw speed for numerical applications,
     then I would suggest to use a functional language (i.e. ocaml).
  3) If your application runs slowly, it is probably better to think first
     about why it is slow, if you can improve the code with faster
     algorithms or removing bottlenecks.

Kristof Bastiaensen

---------------- latin2.rb --------------------
require 'permutation'

$size = (ARGV.shift || 5).to_i
MaxVal = $size-1

RowPermutations = Permutation.new($size).map{|p| p.value}
BaseStr = (2..$size).to_a.join
StrPermutations = Permutation.new($size-1).map{|p| p.project(BaseStr)}

StartColumns = (1..MaxVal).to_a
def init_columns(el)
   a = StartColumns.dup
   a.delete_at(el-1)
   return a
end

def insert(sqr, num, row, columns)
   insert(sqr, num, row+1, columns) if (row == num)
   columns.each do |col|
      next if sqr[row][col] != ?.
      sqr[row][col] = num + ?1
      if (row == MaxVal)
         insert(sqr, num+1, 1, init_columns(num+1))
      elsif (num == MaxVal && row == MaxVal - 1)
         print_solution(sqr)
      else
         insert(sqr, num, row+1, columns - [col])
      end
      sqr[row][col] = ?.
   end
end

def print_solution(sqr)
   RowPermutations.each do |rp|
      StrPermutations.each do |sp|
         0.upto(MaxVal) do |r|
            print sqr[rp[r]].tr(BaseStr, sp)
            print ":"
         end
         puts
      end
   end
end

$square = [("1" + BaseStr)] +
   Array.new($size-1) { |i| (i+2).to_s + "." * ($size - 1) }

insert($square, 0, 1, StartColumns)

···

On Tue, 01 Aug 2006 17:48:41 +0900, Peter Hickman wrote:

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

Time for another update.

Isaac Gouy provided a Java implementation based on mine (ie still pre computes the tables in Perl) that brought the times down to sub 9 seconds.

real 0m8.966s
user 0m5.815s
sys 0m1.488s

But the big news is that William James' revision of his previous Ocaml version is now the fastest.

real 0m3.660s
user 0m1.958s
sys 0m1.421s

The source code for both are available on the web site for you to examine.

New update to the web site. I have added the language versions to the
page and another timing.

Isaac Gouy created a Java version that output the data in binary format
so as to circumvent any overhead in the conversions. It didn't seem to
improve much on his previous version but to be honest I no longer
believe that my system can reliably time such short runs.

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

I started to get my AMD64 4400 box set up to the 6 x 6 grids. It is
probably going to take me a week or more running the tests to get the
timings for this. Is anyone actually that interested or shall we knock
this thread on the head?

I'll leave the web page up.

Peter Hickman wrote:

The source code is available from http://peterhi.dyndns.org/write_it_in_c/index.html

IMHO they all are fast.... or maybe your computer
has a *very* *very* special CPU.... :wink:

  "The machine in question is a macintosh G4 700Hz with 1Gb of ram."

Really cool, such fast results, with 700 Hz CPU clock cycle... :slight_smile:

Ciao,
    Oliver

Isaac Gouy wrote:

Peter Hickman wrote:
  

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html
    
There are some details missing from the webpages
1) which C implementation?
  

[peterhickman]$ gcc -v
Using built-in specs.
Target: powerpc-apple-darwin8
Configured with: /private/var/tmp/gcc/gcc-5341.obj~1/src/configure --disable-checking -enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ --with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib --build=powerpc-apple-darwin8 --host=powerpc-apple-darwin8 --target=powerpc-apple-darwin8
Thread model: posix
gcc version 4.0.1 (Apple Computer, Inc. build 5341)

2) which Java implementation?
  

[peterhickman]$ javac -version
javac 1.5.0_06

Additionally

[peterhickman]$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 6) configuration:

[peterhickman]$ ruby -v
ruby 1.8.4 (2005-12-24) [powerpc-darwin]

3) what hardware?

Macintosh G4 with 1Gb ram

csaba wrote:

Peter Hickman wrote:

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

Hmm, it would look much better for me if you had included Kristof
Bastiaensen's Curry version
into the party... Heck, if that code is not smart then nothing is, and
in a way, it's a much more
interesting question how a compiled functional language compares to
compiled imepative ones
than the thousand time discussed interpreted vs. compiled match-up.

Yes, Curry is anything but mainstream, but you can't say a decent
compiler is not at your disposal.
The Munster CC (which was used by Kristof, and AFAIK that's the most
commonly used implementation) does even have an IDE for OS X.

While I certainly appreciate the efforts that are going into this, I can't help feeling it's all completely irrelevant.

We can engage in cross-implementation pissing contests until the cows come home. None of them help make Ruby any faster.

My question to the community: is there a comprehensive benchmark suite *for Ruby alone* that we can use to tweak compilation settings, try out different core algorithms, and improve what is currently an improvable situation?

If not, would a port of pybench.py be a suitable start?

···

--
Alex

Here's how it goes on my box:

[Latin]$ time ruby latin2.rb > r5

real 0m16.283s
user 0m15.380s
sys 0m0.498s

Which means that your computer is about as crap as mine :slight_smile: I will add it to the pages and make it a download. I think I need a table summarising the timings.

Kristof Bastiaensen wrote:

> The source code is available from
> http://peterhi.dyndns.org/write_it_in_c/index.html

Here is another version in Ruby. It uses a more clever algorithm.
It makes use of the fact that permuting the rows or the columns of
a latin square still gives a latin square. It also passes a constraint
on the columns that the given number can be in, to reduce the backtracking
even more.

It runs in 15 seconds on my machine, which I find quite acceptable.

I am sure that this algorithm coded in a compiled functional or logic
language would be about as fast, or faster than your C code, but with the
advantages of a higher level language.

I still think you benchmark is flawed, because
  1) the C program works only for squares of size 5
  2) It shows totally no relevance to the kind of problems that you
     would use Ruby for. If you want raw speed for numerical applications,
     then I would suggest to use a functional language (i.e. ocaml).

Here's an OCaml version that runs in about 1.5 seconds when
output is redirected to a file on my faster computer (3.2 GHz).
It uses the same algorithm as the C program.
The C version takes 1.961 seconds when writing to /dev/null on
the o.p.'s computer. Since this is my first OCaml program, I'm
certain an expert could improve it.

(* compile with:
   ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe
*)

(* permutation code by Eric C. Cooper *)
let rec distribute elt = function
  > (hd :: tl) as list -> (elt :: list) ::
      (List.map (fun x -> hd :: x) (distribute elt tl))
  > [] -> [ [elt] ]
let rec permute = function
  > x :: rest -> List.flatten (List.map (distribute x) (permute rest))
  > [] -> [ [] ] ;;

let list = [ 1; 2; 3; 4; 5 ]
let size = List.length list
let perms = permute list

let n = List.length perms
let incompat = Array.make_matrix n n true ;;

for x = 0 to n - 1 do
  for y = 0 to n - 1 do
    incompat.(x).(y) <-
      List.exists2 ( = ) (List.nth perms x) (List.nth perms y)
  done
done;;

let join list = String.concat "" (List.map string_of_int list)
let output_strings = List.map join perms ;;
let board = ( Array.make size 0 ) ;;

let rec add_a_row row =
  if row = size then
    print_endline
      ( String.concat ":"
        (List.map (fun i -> List.nth output_strings i)
          (Array.to_list board)) )
  else
    for latest = 0 to n - 1 do
      let prev_row = ref 0 in
      while (!prev_row < row) &&
            (not incompat.(latest).(board.(!prev_row))) do
        incr prev_row
      done;
      if !prev_row = row then
        ( board.(row) <- latest ; add_a_row (row + 1) )
    done
;;

add_a_row 0 ;;

···

On Tue, 01 Aug 2006 17:48:41 +0900, Peter Hickman wrote:

Peter Hickman wrote:

Time for another update.

Isaac Gouy provided a Java implementation based on mine (ie still pre
computes the tables in Perl) that brought the times down to sub 9 seconds.

real 0m8.966s
user 0m5.815s
sys 0m1.488s

But the big news is that William James' revision of his previous Ocaml
version is now the fastest.

real 0m3.660s
user 0m1.958s
sys 0m1.421s

The source code for both are available on the web site for you to examine.

I stole code and ideas from Jon Harrop, so this should be called
the James/Harrop version.

Peter Hickman wrote:

Time for another update.

Isaac Gouy provided a Java implementation based on mine (ie still pre
computes the tables in Perl) that brought the times down to sub 9 seconds.

real 0m8.966s
user 0m5.815s
sys 0m1.488s

sub 9 seconds?
7.3s
"real" is elapsed time, which includes all those other processes that
grabbed CPU after you gave the time command.

Peter Hickman wrote:

Time for another update.

...

The source code for both are available on the web site for you to examine.

That URL again? -Tim

···

This is just a note to clarify things in my mind and then ask a
few questions about the language.

Here's what I think I know:

-- The current version of Ruby is 1.8.5.
-- There is a 1.9 out there, but it's a work in progress and not released
    yet.
-- There is a 2.0 (YARV) out there, but it's also a work in progress.
-- 1.8.5, 1.9, and 2.0 all support slightly different languages.
-- 1.9 is the place where most language experimentation is going on
-- 2.0 will support the same language as 1.8, but with a different underlying
    implementation.
-- There is no formal language specification (and most of us are working off
    the pickaxe book)

Here's what I don't know:

-- Is any of the above wrong?
-- Are there estimated release dates for 1.9 and 2.0?
-- Are there places where changes to the language between 1.8.5, 1.9, and 2.0
    are listed (note: http://wiki.rubygarden.org/Ruby/page/show/Rite doesn't
    really count) ?

Best,

Bill

Ok, so there's a bunch of problems with the Java version.

- In addition to the addRow run and the Java startup time you're also
benchmarking over 5200 array modifications to set Compared values to true
- Your variable naming is entirely contrary to every Java coding convention
published (not a benchmark thing, but it sets off any Java devs warning
flags)
- Almost all of the time spent running is spent building and printing
strings

Benchmarking just the algorithm run itself with no gross string creation and
printing, I'm getting in the neighborhood of 370ms per invocation once
HotSpot has optimized the code. I'll have more detailed numbers shortly.

···

--
Contribute to RubySpec! @ www.headius.com/rubyspec
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com

Peter Hickman wrote:

Isaac Gouy wrote:
> Peter Hickman wrote:
>
>> The source code is available from
>> http://peterhi.dyndns.org/write_it_in_c/index.html

I recall someone stating "benchmarking without analysis is bogus".

As a first step, comment out the print statements

time ./latinc

user 0m0.492s
sys 0m0.004

time java Latin

user 0m0.992s
sys 0m0.052s

With the print statements ~5.4x
without print statements ~2.1x

iirc the Java program is shuffling around double byte unicode chars and
the C program is handling single byte chars.

While I certainly appreciate the efforts that are going into this, I
can't help feeling it's all completely irrelevant.

My only purpose in battling these benchmarks is to help dispel the rumors
that "Java is slow," "VMs are slow," and so on. If Ruby does move to a real
optimizing VM, it will be a good thing...all those folks who continue to
think that VMs are inherently bad need to join the 21st century.

We can engage in cross-implementation pissing contests until the cows

come home. None of them help make Ruby any faster.

My question to the community: is there a comprehensive benchmark suite
*for Ruby alone* that we can use to tweak compilation settings, try out
different core algorithms, and improve what is currently an improvable
situation?

If not, would a port of pybench.py be a suitable start?

Oh man, what I wouldn't give for a really good community-approved benchmark
suite. We've been battling performance gremlins on JRuby since I started,
and we're just now starting to make some good progress. However we don't
have any really solid set of tests to use to benchmark things. I'd be very
pleased if something existed, and I'd be willing to devote some time to make
it happen otherwise.

···

On 8/1/06, Alex Young <alex@blackkettle.org> wrote:

--
Contribute to RubySpec! @ www.headius.com/rubyspec
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com

William James wrote:

Kristof Bastiaensen wrote:
>
> > The source code is available from
> > http://peterhi.dyndns.org/write_it_in_c/index.html
>
> Here is another version in Ruby. It uses a more clever algorithm.
> It makes use of the fact that permuting the rows or the columns of
> a latin square still gives a latin square. It also passes a constraint
> on the columns that the given number can be in, to reduce the backtracking
> even more.
>
> It runs in 15 seconds on my machine, which I find quite acceptable.
>
> I am sure that this algorithm coded in a compiled functional or logic
> language would be about as fast, or faster than your C code, but with the
> advantages of a higher level language.
>
> I still think you benchmark is flawed, because
> 1) the C program works only for squares of size 5
> 2) It shows totally no relevance to the kind of problems that you
> would use Ruby for. If you want raw speed for numerical applications,
> then I would suggest to use a functional language (i.e. ocaml).

Here's an OCaml version that runs in about 1.5 seconds when
output is redirected to a file on my faster computer (3.2 GHz).
It uses the same algorithm as the C program.
The C version takes 1.961 seconds when writing to /dev/null on
the o.p.'s computer. Since this is my first OCaml program, I'm
certain an expert could improve it.

An expert did suggest replacing a couple of lists with arrays.
This should be a bit faster.

(* compile with:
ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe *)

(* permutation code by Eric C. Cooper *)
let rec distribute elt = function
  > (hd :: tl) as list -> (elt :: list) ::
      (List.map (fun x -> hd :: x) (distribute elt tl))
  > [] -> [ [elt] ]
let rec permute = function
  > x :: rest -> List.flatten (List.map (distribute x) (permute rest))
  > [] -> [ [] ] ;;

let list = [ 1; 2; 3; 4; 5 ]
let size = List.length list

let perms = Array.of_list (permute list)
let n = Array.length perms
let incompat = Array.make_matrix n n true ;;

for x = 0 to n - 1 do
  for y = 0 to n - 1 do
    incompat.(x).(y) <-
      List.exists2 ( = ) perms.(x) perms.(y)
  done
done;;

let join list = String.concat "" (List.map string_of_int list)
let output_strings = Array.map join perms ;;
let board = ( Array.make size 0 ) ;;

(* recursive function *)
let rec add_a_row row =
  if row = size then
    print_endline
      ( String.concat ":"
        (List.map (fun i -> output_strings.(i) )
          (Array.to_list board)) )
  else
    for latest = 0 to n - 1 do
      let prev_row = ref 0 in
      while (!prev_row < row) &&
            (not incompat.(latest).(board.(!prev_row))) do
        incr prev_row
      done;
      if !prev_row = row then
        ( board.(row) <- latest ; add_a_row (row + 1) )
    done
;;

add_a_row 0 ;;

···

> On Tue, 01 Aug 2006 17:48:41 +0900, Peter Hickman wrote: