> For a problem of this scale, it seems like it would make sense to use a
> custom class that had some of the methods of String - enough for the
> callees to treat it like a String - but not in fact String. Give it a
> .to_s method to convert to a real String when it's really necessary.
But you still have the GC overhead - regardless whether you create
String or Object.
Not necessarially. If you set the contract, as I have, that the object
is valid only for the duration of execution of the callback to which
it's passed, and that validity expires once the callback returns, you
can allocate 50 (or whatever) of these read-only pseudo-String objects
at the start of reading the file, and re-use them for each row. The key
is that anything that the callback wants to save for use after this
particular call has exited will have to be copied.
I.e., that way you would create a single String per line only while
allowing the caller to create substring instances if needed. If the
client code needs to do that anyway you could as well create those
String instances yourself because it does not make any difference. If
not, you save factor of 50 creations
When you're talking about processing twenty and thirty million row
files, even with the factor of 50 savings you've still got a problem.
(Though I should benchmark this to see how it compares. A quick
comparison on my machine:
a = "aaaa"
n = 10_000_000
n10 = 100_000_000
n.times { } # 2 seconds
n.times { rand(n10) } # 16 seconds
n.times { a + 'b' } # 24 seconds
n.times { rand(n10}.to_s } # 27 seconds
[Oops, looks like "a + 'b'" isn't a no-copy operation after the first
one.])
I should perhaps explain that the common case is that the strings
are examined, but not modified. Particularly common is a relational
restriction: abort processing that row (for that particular chain of
Conditions and Results) if a data item doesn't match a particular value,
or fall in between some particular values, or whatever. Actually needing
to copy a value is relatively rare. More frequent would be incrementing
a counter if a value matches some condition, or adding the value to a
total.
I should also explain that having these pseudo-strings as actual Ruby
objects immediately after parsing is partly a convenience; in something
that needed to be really fast, the initial Conditions, at least, and
later ones and Results, if they match frequently enough, would be in C,
and I suppose I could delay creating the String objects until I had to
call a Ruby routine.
But as it stands, even the callbacks to Ruby are surprisingly fast, if
they don't do a lot. On my machine, for an 800,000 line file (entirely
cached in memory), a scan running only C code takes 1.8 seconds of
CPU. Doing an rb_str_new on each line brings it up to 2.3 seconds;
doing instead a call into a Ruby function that does a simple string
comparison against an instance variable takes 3.4 seconds.
But I'd say chances are that client code will do more complex
manipulations....
As above; in many cases the intial comparisons to discard the line can
be done by simple C code.
Here's another variant: if you mmap the file and it fits into mem...
Sometimes not my case, unfortunately; I have to be able to deal with
files several gigabytes long on a 32-bit machine.
Thanks for all of the useful advice.
cjs
···
On 2007-12-11 17:22 +0900 (Tue), Robert Klemme wrote:
2007/12/11, Tim Hunter <TimHunter@nc.rr.com>:
--
Curt Sampson <cjs@starling-software.com> +81 90 7737 2974
Mobile sites and software consulting: http://www.starling-software.com