Possibly improvable code

Hi,

I made a comparison:

ram@lilith:~/src/ruby$time ruby -e 'num=STDIN.gets; a=Array.new; (0..9).each
{|i| a[i] = 0}; (0..num.length).each {|i| a[num[i..i].to_i] += 1 };
a.each_index {|i| print "#{i} #{a[i]}\n"} ' < pi.out
0 16096
1 16107
2 15965
3 15915
4 15967
5 16169
6 15934
7 16091
8 15999
9 15790

real 0m1.023s
user 0m0.970s
sys 0m0.010s

and

ram@lilith:~/src/ruby$time gawk '{i=0;j=length($0);while( i <= j) {i++;
z=substr($0,i,1); {count=count+ 1; a[z]=a[z]+1; }}}END{ for ( i in a ) {print
i ":" a[i];}print "Digits: " count;}' < pi.out | sort
0:16095
1:16107
2:15965
3:15915
4:15967
5:16169
6:15934
7:16091
8:15999
9:15790
:1
Digits: 160033

real 0m0.305s
user 0m0.310s
sys 0m0.000s

I'm really new to ruby. So I would like to know, if there is any possible
improvement of the ruby code.
pi.out contains about 160000 digits of PI and is made by "ruby sample/pi.rb >
pi.out' inside the src-dir of the ruby-*.tgz

Any hints???

thanks
ralf

# get the input
num = STDIN.gets;
# initialize array entries with 0
a = Array.new(255,0);
# count the occurence of each byte in the input
num.each_byte {|i| a[i] += 1};
# only interested in the byte/characters for 0-9
(?0..?9).each {|i| print "#{[i].pack("c")} #{a[i]}\n"};

I suspect the num[i..i].to_i is pretty expensive... and it's not
really needed in this case.

Cheers,
Ruben

···

At Wed, 16 Jun 2004 23:13:26 +0900, Ralf Müller wrote:

ram@lilith:~/src/ruby$time ruby -e 'num=STDIN.gets; a=Array.new; (0..9).each
{|i| a[i] = 0}; (0..num.length).each {|i| a[num[i..i].to_i] += 1 };
a.each_index {|i| print "#{i} #{a[i]}\n"} ' < pi.out

tried with Hash instead of Array:
ram@lilith:~/src/ruby$time ruby -e 'str=STDIN.gets; h=Hash.new(); (0..9).each
{|i| h[i.to_s] = 0}; (0..str.length-1).each {|i| h[str[i..i]] += 1 }; h.each
{|k,v| print "#{k} #{v}\n"} ' < pi.out | sort
0 16095
1 16107
2 15965
3 15915
4 15967
5 16169
6 15934
7 16091
8 15999
9 15790

real 0m0.920s
user 0m0.910s
sys 0m0.010s

gains 10%.

···

Am Mittwoch, 16. Juni 2004 16:13 schrieb Ralf Müller:

Hi,

I made a comparison:

ram@lilith:~/src/ruby$time ruby -e 'num=STDIN.gets; a=Array.new;
(0..9).each {|i| a[i] = 0}; (0..num.length).each {|i| a[num[i..i].to_i] +=
1 }; a.each_index {|i| print "#{i} #{a[i]}\n"} ' < pi.out
0 16096
1 16107
2 15965
3 15915
4 15967
5 16169
6 15934
7 16091
8 15999
9 15790

real 0m1.023s
user 0m0.970s
sys 0m0.010s

and

ram@lilith:~/src/ruby$time gawk '{i=0;j=length($0);while( i <= j) {i++;
z=substr($0,i,1); {count=count+ 1; a[z]=a[z]+1; }}}END{ for ( i in a )
{print i ":" a[i];}print "Digits: " count;}' < pi.out | sort
0:16095
1:16107
2:15965
3:15915
4:15967
5:16169
6:15934
7:16091
8:15999
9:15790

:1

Digits: 160033

real 0m0.305s
user 0m0.310s
sys 0m0.000s

I'm really new to ruby. So I would like to know, if there is any possible
improvement of the ruby code.
pi.out contains about 160000 digits of PI and is made by "ruby sample/pi.rb
> pi.out' inside the src-dir of the ruby-*.tgz

Any hints???

thanks
ralf

11111111111111111111111111111111111111111111111111111111111111
ram@lilith:~/src/ruby$cat rec_pi.rb
n = STDIN.gets
  tot = 0
  9.times do |i|
    c = n.count(i.to_s)
    tot += c
    puts "#{i} #{c}"
  end
  puts "9 #{n.size - tot}"

ram@lilith:~/src/ruby$time ruby rec_pi.rb < pi.out
0 16095
1 16107
2 15965
3 15915
4 15967
5 16169
6 15934
7 16091
8 15999
9 15790

real 0m0.045s
user 0m0.040s
sys 0m0.000s

22222222222222222222222222222222222222222222222222222222222222222
ram@lilith:~/src/ruby$cat rec_pi_a.rb
c={};
$_.each_byte{|b| c[b]||=0; c[b]+=1 };
c.sort.each {|k,v| puts "#{k - ?0} #{v}"}

ram@lilith:~/src/ruby$time ruby -n rec_pi_a.rb < pi.out
0 16095
1 16107
2 15965
3 15915
4 15967
5 16169
6 15934
7 16091
8 15999
9 15790

real 0m0.260s
user 0m0.240s
sys 0m0.010s

3333333333333333333333333333333333333333333333333333333333333333333
ram@lilith:~/src/ruby$cat rec_pi_b.rb
num = STDIN.gets;
a = Array.new(255,0);
num.each_byte {|i| a[i] += 1};
(?0..?9).each {|i| print "#{[i].pack("c")} #{a[i]}\n"};

ram@lilith:~/src/ruby$time ruby rec_pi_b.rb < pi.out
0 16095
1 16107
2 15965
3 15915
4 15967
5 16169
6 15934
7 16091
8 15999
9 15790

real 0m0.194s
user 0m0.160s
sys 0m0.010s

···

####################################################################

The main difference seems to be using "count" instead of "each_byte". But why
is it 5-6x faster to "count" 9.times than "each_byte" only once?
Or did i ignore anything?

ralf
(-,-)Zzz

Ralf Müller wrote:

tried with Hash instead of Array: ram@lilith:~/src/ruby$time ruby -e 'str=STDIN.gets; h=Hash.new(); (0..9).each {|i| h[i.to_s] = 0}; (0..str.length-1).each {|i| h[str[i..i]] += 1 }; h.each {|k,v| print "#{k} #{v}\n"} ' < pi.out | sort

why@stungun$ time ruby -n -e 'c={};$_.each_byte{|b|c[b]||=0; c[b]+=1}; c.sort.each {|k,v| puts "#{k - ?0} #{v}"}' < pi.out
0 3454
1 3514
2 3358
3 3452
4 3483
5 3510
6 3488
7 3431
8 3482
9 3464

real 0m0.052s
user 0m0.051s
sys 0m0.000s

same idea as yours. just runs 3x swift.

_why

The above code is pretty nice.. I didn't know about the
'count'-method. The reason this is faster, is because the 'big loop',
the scan over the string, is in this code done internally, in the C
implementation of the count method. The other code has the loop
explicitly in ruby-code, which is slower.

The difference however is that the above code does the scan over the
String several times (# of chars to check - 1) in C, while all the other
code scans the String only once in ruby.

Probably there is a trade-off... maybe if you would wanna count all
occurences of [a-zA-Z0-9], then you would have to do the
'count'-method 57 times (57 because the above code assumes that only
the characters you count appear in the String), effectively scanning
the String 57 times. This might be slower than scanning the String
only once in ruby-code. (didn't test this though...)

Ruben

···

At Thu, 17 Jun 2004 16:48:21 +0900, Ralf Müller wrote:

11111111111111111111111111111111111111111111111111111111111111
ram@lilith:~/src/ruby$cat rec_pi.rb
n = STDIN.gets
  tot = 0
  9.times do |i|
    c = n.count(i.to_s)
    tot += c
    puts "#{i} #{c}"
  end
  puts "9 #{n.size - tot}"

The main difference seems to be using "count" instead of "each_byte". But why
is it 5-6x faster to "count" 9.times than "each_byte" only once?
Or did i ignore anything?

I typo a lot, so no command line. Maybe 150-170x speed of OP.

  n = STDIN.gets
  tot = 0
  9.times do |i|
    c = n.count(i.to_s)
    tot += c
    puts "#{i} #{c}"
  end
  puts "9 #{n.size - tot}"

···

On Thu, 17 Jun 2004, why the lucky stiff wrote:

Ralf Müller wrote:

>tried with Hash instead of Array:
>ram@lilith:~/src/ruby$time ruby -e 'str=STDIN.gets; h=Hash.new(); (0..9).each
>{|i| h[i.to_s] = 0}; (0..str.length-1).each {|i| h[str[i..i]] += 1 }; h.each
>{|k,v| print "#{k} #{v}\n"} ' < pi.out | sort
>
>
why@stungun$ time ruby -n -e 'c={};$_.each_byte{|b|c[b]||=0; c[b]+=1};
c.sort.each {|k,v| puts "#{k - ?0} #{v}"}' < pi.out

same idea as yours. just runs 3x swift.

--
Relm

There is a trade-off, but you may need to throw in some kanji or unicode
to get there. With just [a-zA-Z0-9], a single scan in Ruby is still
slower. Here are some timings:

              user system total real
array: 0.880000 0.080000 0.960000 ( 0.958946)
hash: 1.210000 0.070000 1.280000 ( 1.330615)
count: 0.100000 0.000000 0.100000 ( 0.091805)
delete: 0.030000 0.020000 0.050000 ( 0.051760)

Both 'array' and 'hash' use #each_byte (M iterations in Ruby code),
'count' uses #count repeatedly (62*M iterations in C code), and 'delete'
uses #delete to partition the string into 5 substrings and #delete! to
count characters (5*M + 62*M/5 = about 17*M iterations in C code).

And the benchmark code:

#!/usr/bin/ruby -w

str = ['a'..'z','A'..'Z',0..9].map {|x| x.to_a}.flatten.join * 10000

require 'benchmark'

Benchmark.bm(8) do |bm|
  bm.report('array:') do
    a = Array.new(255,0);
    str.each_byte {|i| a[i] += 1};
    [?0..?9,?a..?z,?A..?Z].map {|x| x.to_a}.flatten.each {|i|
      "#{[i].pack("c")} #{a[i]}"
    };
  end
  bm.report('hash:') do
    c={}
    str.each_byte{|b|c[b]||=0; c[b]+=1}
    c.sort.each {|k,v| "#{k.chr} #{v}"}
  end
  bm.report('count:') do
    ['0'..'9','a'..'z','A'..'Z'].map {|x| x.to_a}.flatten.each do |i|
      "#{i} #{str.count(i)}"
    end
  end
  bm.report('delete:') do
    ['0'..'9','a'..'m','n'..'z','A'..'M','N'..'Z'].each do |range|
      s = str.delete("^#{range.first}-#{range.last}")
      range.each do |i|
        n = s.size
        s.delete!(i)
        "#{i} #{n - s.size}"
      end
    end
  end
end

__END__

···

On Thu, 17 Jun 2004, Ruben wrote:

The difference however is that the above code does the scan over the
String several times (# of chars to check - 1) in C, while all the other
code scans the String only once in ruby.

Probably there is a trade-off... maybe if you would wanna count all
occurences of [a-zA-Z0-9], then you would have to do the
'count'-method 57 times (57 because the above code assumes that only
the characters you count appear in the String), effectively scanning
the String 57 times. This might be slower than scanning the String
only once in ruby-code. (didn't test this though...)

--
Relm

I tried this

ram@lilith:~/src/ruby$cat trcount-simple.rb
#!/usr/local/bin/ruby
#num=STDIN.gets
num=STDIN.gets
# prepare an array
a = Array.new;
(0..9).each {|i| a[i] = 0}
# for num in an single line
#(0..num.length).each {|n| a[num[n..n].to_i] += 1 }
(0..num.length-1).each {|n| a[num[n..n].to_i] += 1 }
# print the result
a.each_index {|index| print "#{index} #{a[index]}\n" }
ram@lilith:~/src/ruby$time ruby trcount-simple.rb < pi.out
0 16095
1 16107
2 15965
3 15915
4 15967
5 16169
6 15934
7 16091
8 15999
9 15790

real 0m0.964s
user 0m0.950s
sys 0m0.010s
ram@lilith:~/src/ruby$time ruby -e 'num=STDIN.gets; a=Array.new; (0..9).each
{|i| a[i] = 0
}; (0..num.length).each {|i| a[num[i..i].to_i] += 1 }; a.each_index {|i| print
"#{i} #{a[i
]}\n"} ' < pi.out
0 16096
1 16107
2 15965
3 15915
4 15967
5 16169
6 15934
7 16091
8 15999
9 15790

real 0m0.955s
user 0m0.940s
sys 0m0.000s

It does not seem to make any difference if the source i given at the command
line or in a separate file:
ram@lilith:~/src/ruby$cat rec_pi.rb
n = STDIN.gets
  tot = 0
  9.times do |i|
    c = n.count(i.to_s)
    tot += c
    puts "#{i} #{c}"
  end
  puts "9 #{n.size - tot}"
ram@lilith:~/src/ruby$time ruby rec_pi.rb < pi.out
0 16095
1 16107
2 15965
3 15915
4 15967
5 16169
6 15934
7 16091
8 15999
9 15790

real 0m0.048s
user 0m0.040s
sys 0m0.010s
ram@lilith:~/src/ruby$time ruby -e 'n = STDIN.gets; tot = 0; 9.times do |i| c
= n.count(i.to_s); tot += c; puts "#{i} #{c}"; end; puts "9 #{n.size - tot}"'
<pi.out
0 16095
1 16107
2 15965
3 15915
4 15967
5 16169
6 15934
7 16091
8 15999
9 15790

real 0m0.043s
user 0m0.010s
sys 0m0.030s

But maybe, you're right and these conditions do just not lead to that
difference. Under which circumstances does the difference occur?

regards
ralf

···

Am Donnerstag, 17. Juni 2004 00:00 schrieb Relm:

On Thu, 17 Jun 2004, why the lucky stiff wrote:
> Ralf Müller wrote:
> >tried with Hash instead of Array:
> >ram@lilith:~/src/ruby$time ruby -e 'str=STDIN.gets; h=Hash.new();
> > (0..9).each {|i| h[i.to_s] = 0}; (0..str.length-1).each {|i|
> > h[str[i..i]] += 1 }; h.each {|k,v| print "#{k} #{v}\n"} ' < pi.out |
> > sort
>
> why@stungun$ time ruby -n -e 'c={};$_.each_byte{|b|c[b]||=0; c[b]+=1};
> c.sort.each {|k,v| puts "#{k - ?0} #{v}"}' < pi.out
>
> same idea as yours. just runs 3x swift.

I typo a lot, so no command line. Maybe 150-170x speed of OP.

  n = STDIN.gets
  tot = 0
  9.times do |i|
    c = n.count(i.to_s)
    tot += c
    puts "#{i} #{c}"
  end
  puts "9 #{n.size - tot}"