Robert Klemme wrote:
That would work, but even with marshal dumping the data set is just too large for memory to handle quickly.
Which data set - the range definitions or the output? I thought this is
a one off process that transforms a large input file into a large output
file.
I think I'm going to move the project over to PostgreSQL and see if that doesn't speed things up a considerable amount, Thanks Robert.
That's of course an option. But I still feel kind of at a loss about
what exactly you are doing. Is this just a single processing step in a
much larger application?
The dump would be to the pre-defined hash, to hence retrieve the information faster.
I would not use the term "hash" here because this is an implementation detail. Basically what you want to store is the mapping from input numbers mapped to output numbers via ranges, don't you?
To answer your 2nd question yes this is just a single step in a very large 12 step application. I'm hoping to condense it down to about 8 steps when I finish. This step alone involves transforming a large dataset into a smaller dataset.
I'm trying to extract all the numbers between ranges and push the keys of the hash results into a file. This file will then be opened by another part of the step process to be analyzed.
IE:
if the transformation involved the file of:
12345
67423
97567
45345
ect.
I would want to pull all of those numbers and get the keys for those hash ranges
IE:
12000..15000 => 100
60000..70000 => 250
ect.
How many of those ranges do you have? Is there any mathematical relation between each input range and its output value?
So 12345 would fall in the range of 12000.15000 so the output file would get 100 added to it. Then the next step would be analyzing the results (IE: 100).
So let me rephrase it to make sure I understood properly: you are reading a amount of numbers and map each number to another one (via ranges). Mapped numbers are input to the next processing stage. It seems you would want to output each mapped value only once; this immediately suggests set semantic.
Hope this explains things a bit better.
Yes, we're getting there.
Actually I find this a nice exercise in requirements extrapolation. In this case I try to extract the requirements from you (aka the customer). 
Kind regards
robert
How about
#!/bin/env ruby
require 'set'
class RangeLabels
def initialize(labels, fallback = nil)
@labels = labels.sort_by {|key,lab| key}
@fallback = fallback
end
def lookup(val)
# slow if there are many ranges
# this can be improved by binary search!
@labels.each do |key, lab|
return lab if val < key
end
@fallback
end
alias lookup
end
rl = RangeLabels.new [
[12000, 50],
[15000, 100],
[60000, nil],
[70000, 250],
]
output = Set.new
ARGF.each do |line|
line.scan /\d+/ do |val|
x = rl[val.to_i] and output << x
end
end
puts output.to_a
···
On 09.04.2008 19:53, Michael Linfield wrote:
On 09.04.2008 00:30, Michael Linfield wrote: