Heavy loop functions slow

Alright so I was playing with my large amounts of data and ran into yet
another problem with shoving it into a loop that requires a substantial
amount of memory.

dataArray = []
output = arrayOut.to_s.chop!.split(",")

output1 = output[0..356130]
output2 = output[356131..712260]
output3 = output[712261..1068390]
output4 = output[1068391..1424521]

count = 0
    output1.each do |out|
      out = out.to_i
      push = hashRange[out]
      dataArray << push
      count+=1
      puts "#{push} - #{count}" #Testing purposes
    end

I broke 'output' up into several blocks for other purposes than just
this loop, but also to see what the effect would be. As you can see
we're talkin about almost 1,500,000 array elements.
-->hashRange is a hash obviously

Problem being: that test line I added 'puts "#{push} - #{count}"'
solidifies the fact that it moves through 1 element every 5-6sec...
After doing my math thats about 86 days to finish 1,500,000 elements :frowning:

Any ideas that would speed this up are much appreciated!! Otherwise I'll
be back in 3 months IF I dont get an error :smiley:

Thanks,

- Mac

···

--
Posted via http://www.ruby-forum.com/.

Try further benchmarking what causes the slowness. Isolate what code is causing the slowness itself. Also, without knowing what hashRange and output contain, it is not obvious where the slowness comes from. For instance if hashRange = {} and output = (0..1_000_000).to_a, this code takes relatively little time to execute.

Alright so I was playing with my large amounts of data and ran into yet
another problem with shoving it into a loop that requires a substantial
amount of memory.

dataArray =
output = arrayOut.to_s.chop!.split(",")

set arrayOut to nil if you don't need it any more.

output1 = output[0..356130]
output2 = output[356131..712260]
output3 = output[712261..1068390]
output4 = output[1068391..1424521]

You dont need output here, set it to nil to allow for garbage collection

count = 0
    output1.each do |out|
      out = out.to_i
      push = hashRange[out]
      dataArray << push
      count+=1
      puts "#{push} - #{count}" #Testing purposes
    end

1. you can convert the output to numbers in one pass, though use
benchmark to see the actual gain:

output = arrayOut.to_s.chop!.split(",").map {|out| out.to_i }

2. if you are looking for numbers only, you can do something like

output =
arrayOut.to_s.chop!.scan(/\d+/) {|out| output << out.to_i }
(you can count the items, and switch to output2 when output1 has
enough, thus 1. creating smaller arrays, 2. doing two things in one
step.)

3. even in this case, you still have both the original arrayOut and
the long string (.to_s) in memory.
It might be faster, if you could iterate through the array without
creating the intermediate string. The question is 1. will it help? 2.
Is it worth it?

···

On Tue, Apr 8, 2008 at 7:17 AM, Michael Linfield <globyy3000@hotmail.com> wrote:

Obviously there is a lot of code missing from the piece above. Can
you explain, what you are trying to achieve? What is your input file
format and what kind of transformation do you want to do on it? I
looked through your other postings but it did not become clear to me.

Cheers

robert

···

2008/4/8, Michael Linfield <globyy3000@hotmail.com>:

Alright so I was playing with my large amounts of data and ran into yet
another problem with shoving it into a loop that requires a substantial
amount of memory.

dataArray =
output = arrayOut.to_s.chop!.split(",")

output1 = output[0..356130]
output2 = output[356131..712260]
output3 = output[712261..1068390]
output4 = output[1068391..1424521]

count = 0
    output1.each do |out|
      out = out.to_i
      push = hashRange[out]
      dataArray << push
      count+=1
      puts "#{push} - #{count}" #Testing purposes
    end

I broke 'output' up into several blocks for other purposes than just
this loop, but also to see what the effect would be. As you can see
we're talkin about almost 1,500,000 array elements.
-->hashRange is a hash obviously

Problem being: that test line I added 'puts "#{push} - #{count}"'
solidifies the fact that it moves through 1 element every 5-6sec...
After doing my math thats about 86 days to finish 1,500,000 elements :frowning:

Any ideas that would speed this up are much appreciated!! Otherwise I'll
be back in 3 months IF I dont get an error :smiley:

--
use.inject do |as, often| as.you_can - without end

Robert Klemme wrote:

output2 = output[356131..712260]
    end
Any ideas that would speed this up are much appreciated!! Otherwise I'll
be back in 3 months IF I dont get an error :smiley:

Obviously there is a lot of code missing from the piece above. Can
you explain, what you are trying to achieve? What is your input file
format and what kind of transformation do you want to do on it? I
looked through your other postings but it did not become clear to me.

Cheers

robert

Alright heres the breakdown of everything.

dataArray =

# arrayOut consist of all integer data stored in a text file.
# its called upon via IO.foreach("data.txt"){|x| dataArray << x}
# dataArray being just a predefined array ie: dataArray =

output = arrayOut.to_s.chop!.split(",")

#Each of these outputs breaks down this huge array into 4 smaller arrays
output1 = output[0..356130]
output2 = output[356131..712260]
output3 = output[712261..1068390]
output4 = output[1068391..1424521]

#hashRange[out] is basically calling a hash in the following context.
# hash = { 1=> { 20000..30000 => 12345 } }
#so 'out' is calling the range of the key to which contains its defined
value
#basically its saying hashRange[25000] #=> 12345 as an example

#everything imported to dataArray is a string, so it must be converted
to an
#integer to correctly match the range key

#after benchmarking some elements of the loop below its found to be
#the push = hashRange[out] is whats slowing everything down.
#everything a nil 'out' is shoved into the query it takes about 8sec.
#when its a correct number, takes about 5sec

#the hashRange file is about 78mb, to which I had to load in as
#8 separate data files, then shove those into an eval to convert it
#to a hash

count = 0
    output1.each do |out|
      out = out.to_i
      push = hashRange[out]
      dataArray << push
      count+=1
      puts "#{push} - #{count}" #Testing purposes
    end

#I guess what I need now is a faster way to access this pre-defined
hash.
#SQL is one possibility but that could be considered a whole other
#forum post :slight_smile:

Any other questions feel free to ask,
Your guy's insight is much appreciated.

Thanks again,

- Mac

···

2008/4/8, Michael Linfield <globyy3000@hotmail.com>:

--
Posted via http://www.ruby-forum.com/\.

Let's see whether I understood correctly: you have a file with
multiple integer numbers per line. You have defined a range mapping,
i.e. each interval an int can be in has a label. You want to read in
all ints and output their labels.

If this is correct, this is what I'd do:

$ ruby -e '20.times {|i| puts i}' >| x
14:54:37 /c/Temp
$ ./rl.rb x
low
low
medium
medium
medium
high
high
high
high
high
no label
no label
no label
no label
no label
no label
no label
no label
no label
no label
14:54:41 /c/Temp
$ cat rl.rb
#!/bin/env ruby

class RangeLabels
  def initialize(labels)
    @labels = labels.sort_by {|key,lab| key}
  end

  def lookup(val)
    # slow, this can be improved by binary search!
    @labels.each do |key, lab|
      return lab if val < key
    end
    "no label"
  end
end

rl = RangeLabels.new [
  [2, "low"],
  [5, "medium"],
  [10, "high"],
]

ARGF.each do |line|
  first = true
  line.scan /\d+/ do |val|
    if first
      first = false
    else
      print ", "
    end

    print rl.lookup(val.to_i)
  end

  print "\n"
end
14:54:52 /c/Temp
$

Kind regards

robert

···

2008/4/8, Michael Linfield <globyy3000@hotmail.com>:

Robert Klemme wrote:
> 2008/4/8, Michael Linfield <globyy3000@hotmail.com>:

>> output2 = output[356131..712260]
>> end

>> Any ideas that would speed this up are much appreciated!! Otherwise I'll
>> be back in 3 months IF I dont get an error :smiley:
>
> Obviously there is a lot of code missing from the piece above. Can
> you explain, what you are trying to achieve? What is your input file
> format and what kind of transformation do you want to do on it? I
> looked through your other postings but it did not become clear to me.
>
> Cheers
>
> robert

Alright heres the breakdown of everything.

dataArray =

# arrayOut consist of all integer data stored in a text file.
# its called upon via IO.foreach("data.txt"){|x| dataArray << x}
# dataArray being just a predefined array ie: dataArray =

output = arrayOut.to_s.chop!.split(",")

#Each of these outputs breaks down this huge array into 4 smaller arrays

output1 = output[0..356130]
output2 = output[356131..712260]
output3 = output[712261..1068390]
output4 = output[1068391..1424521]

#hashRange[out] is basically calling a hash in the following context.
# hash = { 1=> { 20000..30000 => 12345 } }
#so 'out' is calling the range of the key to which contains its defined
value
#basically its saying hashRange[25000] #=> 12345 as an example

#everything imported to dataArray is a string, so it must be converted
to an
#integer to correctly match the range key

#after benchmarking some elements of the loop below its found to be
#the push = hashRange[out] is whats slowing everything down.
#everything a nil 'out' is shoved into the query it takes about 8sec.
#when its a correct number, takes about 5sec

#the hashRange file is about 78mb, to which I had to load in as
#8 separate data files, then shove those into an eval to convert it
#to a hash

count = 0
    output1.each do |out|
      out = out.to_i
      push = hashRange[out]
      dataArray << push
      count+=1
      puts "#{push} - #{count}" #Testing purposes
    end

#I guess what I need now is a faster way to access this pre-defined
hash.
#SQL is one possibility but that could be considered a whole other
#forum post :slight_smile:

Any other questions feel free to ask,
Your guy's insight is much appreciated.

--
use.inject do |as, often| as.you_can - without end

That would work, but even with marshal dumping the data set is just too
large for memory to handle quickly. I think I'm going to move the
project over to PostgreSQL and see if that doesn't speed things up a
considerable amount, Thanks Robert.

- Mac

···

--
Posted via http://www.ruby-forum.com/.

That would work, but even with marshal dumping the data set is just too large for memory to handle quickly.

Which data set - the range definitions or the output? I thought this is a one off process that transforms a large input file into a large output file.

I think I'm going to move the project over to PostgreSQL and see if that doesn't speed things up a considerable amount, Thanks Robert.

That's of course an option. But I still feel kind of at a loss about what exactly you are doing. Is this just a single processing step in a much larger application?

Cheers

  robert

···

On 09.04.2008 00:30, Michael Linfield wrote:

Robert Klemme wrote:

That would work, but even with marshal dumping the data set is just too
large for memory to handle quickly.

Which data set - the range definitions or the output? I thought this is
a one off process that transforms a large input file into a large output
file.

I think I'm going to move the
project over to PostgreSQL and see if that doesn't speed things up a
considerable amount, Thanks Robert.

That's of course an option. But I still feel kind of at a loss about
what exactly you are doing. Is this just a single processing step in a
much larger application?

Cheers

  robert

The dump would be to the pre-defined hash, to hence retrieve the
information faster.

To answer your 2nd question yes this is just a single step in a very
large 12 step application. I'm hoping to condense it down to about 8
steps when I finish. This step alone involves transforming a large
dataset into a smaller dataset.

I'm trying to extract all the numbers between ranges and push the keys
of the hash results into a file. This file will then be opened by
another part of the step process to be analyzed.

IE:
if the transformation involved the file of:
12345
67423
97567
45345
ect.
I would want to pull all of those numbers and get the keys for those
hash ranges
IE:
12000..15000 => 100
60000..70000 => 250
ect.

So 12345 would fall in the range of 12000.15000 so the output file would
get 100 added to it. Then the next step would be analyzing the results
(IE: 100).
Hope this explains things a bit better.

Thanks,

- Mac

···

On 09.04.2008 00:30, Michael Linfield wrote:

--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote:

That would work, but even with marshal dumping the data set is just too large for memory to handle quickly.

Which data set - the range definitions or the output? I thought this is
a one off process that transforms a large input file into a large output
file.

I think I'm going to move the project over to PostgreSQL and see if that doesn't speed things up a considerable amount, Thanks Robert.

That's of course an option. But I still feel kind of at a loss about
what exactly you are doing. Is this just a single processing step in a
much larger application?

The dump would be to the pre-defined hash, to hence retrieve the information faster.

I would not use the term "hash" here because this is an implementation detail. Basically what you want to store is the mapping from input numbers mapped to output numbers via ranges, don't you?

To answer your 2nd question yes this is just a single step in a very large 12 step application. I'm hoping to condense it down to about 8 steps when I finish. This step alone involves transforming a large dataset into a smaller dataset.

I'm trying to extract all the numbers between ranges and push the keys of the hash results into a file. This file will then be opened by another part of the step process to be analyzed.

IE:
if the transformation involved the file of:
12345
67423
97567
45345
ect.
I would want to pull all of those numbers and get the keys for those hash ranges
IE:
12000..15000 => 100
60000..70000 => 250
ect.

How many of those ranges do you have? Is there any mathematical relation between each input range and its output value?

So 12345 would fall in the range of 12000.15000 so the output file would get 100 added to it. Then the next step would be analyzing the results (IE: 100).

So let me rephrase it to make sure I understood properly: you are reading a amount of numbers and map each number to another one (via ranges). Mapped numbers are input to the next processing stage. It seems you would want to output each mapped value only once; this immediately suggests set semantic.

Hope this explains things a bit better.

Yes, we're getting there. :slight_smile: Actually I find this a nice exercise in requirements extrapolation. In this case I try to extract the requirements from you (aka the customer). :slight_smile:

Kind regards

  robert

How about

#!/bin/env ruby

require 'set'

class RangeLabels
   def initialize(labels, fallback = nil)
     @labels = labels.sort_by {|key,lab| key}
     @fallback = fallback
   end

   def lookup(val)
     # slow if there are many ranges
     # this can be improved by binary search!
     @labels.each do |key, lab|
       return lab if val < key
     end
     @fallback
   end

   alias lookup
end

rl = RangeLabels.new [
   [12000, 50],
   [15000, 100],
   [60000, nil],
   [70000, 250],
]

output = Set.new

ARGF.each do |line|
   line.scan /\d+/ do |val|
     x = rl[val.to_i] and output << x
   end
end

puts output.to_a

···

On 09.04.2008 19:53, Michael Linfield wrote:

On 09.04.2008 00:30, Michael Linfield wrote:

The dump would be to the pre-defined hash, to hence retrieve the
information faster.

To answer your 2nd question yes this is just a single step in a very
large 12 step application. I'm hoping to condense it down to about 8
steps when I finish. This step alone involves transforming a large
dataset into a smaller dataset.

I'm trying to extract all the numbers between ranges and push the keys
of the hash results into a file. This file will then be opened by
another part of the step process to be analyzed.

IE:
if the transformation involved the file of:
12345
67423
97567
45345
ect.
I would want to pull all of those numbers and get the keys for those
hash ranges
IE:
12000..15000 => 100
60000..70000 => 250
ect.

So 12345 would fall in the range of 12000.15000 so the output file would
get 100 added to it. Then the next step would be analyzing the results
(IE: 100).
Hope this explains things a bit better.

Thanks,

cfp:~ > cat a.rb

···

#
# use narray for fast ruby numbers
#
   require 'rubygems'
   require 'narray'

#
# ton-o-date
#
   huge = NArray.int(2 ** 25).indgen * 100 # 0, 100, 200, 300, etc

#
# bin data
#
# 0...100 -> 0
# 100...200 -> 1
# 200...300 -> 2
# etc...
#

   a = Time.now.to_f

   p huge

   huge.div! 100 # 42 -> 0, 127 -> 1, 2227 -> 222

   b = Time.now.to_f

   elapsed = b - a

   p elapsed

   p huge

cfp:~ > ruby a.rb
NArray.int(33554432):
[ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, ... ]
0.202844142913818
NArray.int(33554432):
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ... ]

so that's doing about 33 million elements in around 2/10ths of a second....

a @ http://codeforpeople.com/
--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

Better for whom?

···

On 10/04/2008, at 7:27 AM, ara.t.howard wrote:

we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

Better for whom?

for my wife - obviously! :wink:

a @ http://codeforpeople.com/

···

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama