Zero-copy String Handling

Curt_Sampson · 10 December 2007 15:52

I'm writing a C extension that involves fast scanning through and
parsing of tab-delimited files. Basically, I mmap the file, figure out
where the row and column boundaries are, and for each row end up with
an array of strings (pointer and length) for each row that I then pass
on to other C or Ruby code. The array and its strings are not supposed
to be modified by the callees, only read, and I can also live with the
callees being required to make their own copies of the strings and
arrays if they need to keep the data accessable after the call, if I can
figure out some way to enforce that.

It appears to me that this means I don't really have any need to
copy the data; I ought to just be able to set up a bunch of (likely
frozen) String objects and then tweak the ptr and len on them and pass
them around, avoiding any allocations or data copies. From a bit of
experimentation, I can see that dropping several calls to rb_str_new for
each row results in an enormous speed increase--about ten-fold--in how
fast I can scan through the file.

Does anybody have any suggestions on a reasonably safe way to do this?

cjs

···

--
Curt Sampson <cjs@starling-software.com> +81 90 7737 2974
Mobile sites and software consulting: http://www.starling-software.com

Eric_I2 · 10 December 2007 16:15

I've read (although I've never verified this by looking at the source
code) that Ruby's strings are copy-on-write. So if you have code like
this:

    s1 = "This is some string."
    s2 = s1[5, 7] # => "is some"
    s3 = s1[5..11] # => "is some"

no copies are made and data is shared ... until you modify the data,
in which case the copy is made.

If you wanted to enforce the read-only aspect, perhaps you could
create a subclass of String where all the methods and operators that
modify the character data are re-implemented to raise exceptions.

Eric

···

On Dec 10, 10:52 am, Curt Sampson <c...@cynic.net> wrote:

It appears to me that this means I don't really have any need to
copy the data; I ought to just be able to set up a bunch of (likely
frozen) String objects and then tweak the ptr and len on them and pass
them around, avoiding any allocations or data copies. From a bit of
experimentation, I can see that dropping several calls to rb_str_new for
each row results in an enormous speed increase--about ten-fold--in how
fast I can scan through the file.

Does anybody have any suggestions on a reasonably safe way to do this?

====

Interested in hands-on, on-site Ruby training? See http://LearnRuby.com
for information about a well-reviewed class.

Robert_K1 · 10 December 2007 17:44

This is what I'd do: create a single string per line and use substring (aka #) to create strings that represent the portion needed; byte buffer will be shared then. You don't even need to freeze them because of copy on write.

Kind regards

robert

···

On 10.12.2007 16:52, Curt Sampson wrote:

I'm writing a C extension that involves fast scanning through and
parsing of tab-delimited files. Basically, I mmap the file, figure out
where the row and column boundaries are, and for each row end up with
an array of strings (pointer and length) for each row that I then pass
on to other C or Ruby code. The array and its strings are not supposed
to be modified by the callees, only read, and I can also live with the
callees being required to make their own copies of the strings and
arrays if they need to keep the data accessable after the call, if I can
figure out some way to enforce that.

It appears to me that this means I don't really have any need to
copy the data; I ought to just be able to set up a bunch of (likely
frozen) String objects and then tweak the ptr and len on them and pass
them around, avoiding any allocations or data copies. From a bit of
experimentation, I can see that dropping several calls to rb_str_new for
each row results in an enormous speed increase--about ten-fold--in how
fast I can scan through the file.

Does anybody have any suggestions on a reasonably safe way to do this?

a11 · 10 December 2007 19:12

this code shares memory between an mmap and narray

http://codeforpeople.com/lib/ruby/nmap/

look at na_str.c, it uses rb_str_new4 to avoid copying.

a @ http://codeforpeople.com/

···

On Dec 10, 2007, at 8:52 AM, Curt Sampson wrote:

Does anybody have any suggestions on a reasonably safe way to do this?

--
share your knowledge. it's a way to achieve immortality.
h.h. the 14th dalai lama

Curt_Sampson · 10 December 2007 23:39

This was attractive for a couple of seconds, until I realized that not
only does it still add a copy of the entire row of data (albeit as one
large allocation rather than many small ones), but it also doesn't
reduce my object creation load at all. I seem to recall last time I was
playing around with this sort of thing and using a profiler, GC was an
enormous cost for me. This probably isn't surprising given the nature of
the problem; a typical file might be ten million rows of fifty elements
each, which would be 500 million object creations and collections.

cjs

···

On 2007-12-11 02:44 +0900 (Tue), Robert Klemme wrote:

This is what I'd do: create a single string per line and use substring
(aka #) to create strings that represent the portion needed; byte
buffer will be shared then. You don't even need to freeze them because
of copy on write.

--
Curt Sampson <cjs@starling-software.com> +81 90 7737 2974
Mobile sites and software consulting: http://www.starling-software.com

Tim_Hunter4 · 10 December 2007 23:50

Curt Sampson wrote:

···

On 2007-12-11 02:44 +0900 (Tue), Robert Klemme wrote:

This is what I'd do: create a single string per line and use substring
(aka #) to create strings that represent the portion needed; byte
buffer will be shared then. You don't even need to freeze them because
of copy on write.

This was attractive for a couple of seconds, until I realized that not
only does it still add a copy of the entire row of data (albeit as one
large allocation rather than many small ones), but it also doesn't
reduce my object creation load at all. I seem to recall last time I was
playing around with this sort of thing and using a profiler, GC was an
enormous cost for me. This probably isn't surprising given the nature of
the problem; a typical file might be ten million rows of fifty elements
each, which would be 500 million object creations and collections.

cjs

For a problem of this scale, it seems like it would make sense to use a custom class that had some of the methods of String - enough for the callees to treat it like a String - but not in fact String. Give it a .to_s method to convert to a real String when it's really necessary.

--
RMagick: http://rmagick.rubyforge.org/

Curt_Sampson · 11 December 2007 01:41

That's a good idea. In fact, I could use a struct of the same shape as
RString, and just invoke methods in string.c to do the real work for the
methods I decide to implement. I could also leave the objects existant
after the backing pages have been unmapped (as I do now with the String
objects), but tweak them so that they throw an exception when you try
to use them after that, as opposed to the current situation where the
interpreter dumps core.

Thanks for the idea.

cjs

···

On 2007-12-11 08:50 +0900 (Tue), Tim Hunter wrote:

For a problem of this scale, it seems like it would make sense to use a
custom class that had some of the methods of String - enough for the
callees to treat it like a String - but not in fact String. Give it a .to_s
method to convert to a real String when it's really necessary.

--
Curt Sampson <cjs@starling-software.com> +81 90 7737 2974
Mobile sites and software consulting: http://www.starling-software.com

Robert_K1 · 11 December 2007 08:22

But you still have the GC overhead - regardless whether you create
String or Object.

Another idea would be to change the interface. "Pseudo" code:

io.each do |line|
  line.freeze
  some_smart_parsing do |line, start, end|
    # line does not change, start and end are integer indexes
  end
end

I.e., that way you would create a single String per line only while
allowing the caller to create substring instances if needed. If the
client code needs to do that anyway you could as well create those
String instances yourself because it does not make any difference. If
not, you save factor of 50 creations (ints are treated differently).
But I'd say chances are that client code will do more complex
manipulations and in that case the question is whether you have the
right (i.e. fast enough) tool at all. Because these blocks *will* do
some calculation and those will likely also create objects etc. I'd
say you either have to live with the overhead or use a different tool
altogether.

Here's another variant: if you mmap the file and it fits into mem you
could as well do

some_smart_parsing do |s, start, end, field_index|
# s is the whole file and does not change, start and end are integer indexes
end

The field index could be a flag indicating first or not first field in
a record. But this interface starts to get contrived and you still
have the overhead in the block.

Kind regards

robert

···

2007/12/11, Tim Hunter <TimHunter@nc.rr.com>:

Curt Sampson wrote:
> On 2007-12-11 02:44 +0900 (Tue), Robert Klemme wrote:
>
>> This is what I'd do: create a single string per line and use substring
>> (aka #) to create strings that represent the portion needed; byte
>> buffer will be shared then. You don't even need to freeze them because
>> of copy on write.
>
> This was attractive for a couple of seconds, until I realized that not
> only does it still add a copy of the entire row of data (albeit as one
> large allocation rather than many small ones), but it also doesn't
> reduce my object creation load at all. I seem to recall last time I was
> playing around with this sort of thing and using a profiler, GC was an
> enormous cost for me. This probably isn't surprising given the nature of
> the problem; a typical file might be ten million rows of fifty elements
> each, which would be 500 million object creations and collections.
>
> cjs

For a problem of this scale, it seems like it would make sense to use a
custom class that had some of the methods of String - enough for the
callees to treat it like a String - but not in fact String. Give it a
.to_s method to convert to a real String when it's really necessary.

--
use.inject do |as, often| as.you_can - without end

Tim_Hunter4 · 11 December 2007 01:55

Curt Sampson wrote:

···

On 2007-12-11 08:50 +0900 (Tue), Tim Hunter wrote:

For a problem of this scale, it seems like it would make sense to use a custom class that had some of the methods of String - enough for the callees to treat it like a String - but not in fact String. Give it a .to_s method to convert to a real String when it's really necessary.

That's a good idea. In fact, I could use a struct of the same shape as
RString, and just invoke methods in string.c to do the real work for the
methods I decide to implement. I could also leave the objects existant
after the backing pages have been unmapped (as I do now with the String
objects), but tweak them so that they throw an exception when you try
to use them after that, as opposed to the current situation where the
interpreter dumps core.

Thanks for the idea.

cjs

Works for me. RMagick 2 has the ability to "destroy" an image. A destroyed image is like a closed file. The object still exists but any attempt to use it raises a DestroyedImage exception.

--
RMagick: http://rmagick.rubyforge.org/

Curt_Sampson · 11 December 2007 15:41

> For a problem of this scale, it seems like it would make sense to use a
> custom class that had some of the methods of String - enough for the
> callees to treat it like a String - but not in fact String. Give it a
> .to_s method to convert to a real String when it's really necessary.

But you still have the GC overhead - regardless whether you create
String or Object.

Not necessarially. If you set the contract, as I have, that the object
is valid only for the duration of execution of the callback to which
it's passed, and that validity expires once the callback returns, you
can allocate 50 (or whatever) of these read-only pseudo-String objects
at the start of reading the file, and re-use them for each row. The key
is that anything that the callback wants to save for use after this
particular call has exited will have to be copied.

I.e., that way you would create a single String per line only while
allowing the caller to create substring instances if needed. If the
client code needs to do that anyway you could as well create those
String instances yourself because it does not make any difference. If
not, you save factor of 50 creations

When you're talking about processing twenty and thirty million row
files, even with the factor of 50 savings you've still got a problem.
(Though I should benchmark this to see how it compares. A quick
comparison on my machine:

    a = "aaaa"
    n = 10_000_000
    n10 = 100_000_000
    n.times { } # 2 seconds
    n.times { rand(n10) } # 16 seconds
    n.times { a + 'b' } # 24 seconds
    n.times { rand(n10}.to_s } # 27 seconds

[Oops, looks like "a + 'b'" isn't a no-copy operation after the first
one.])

I should perhaps explain that the common case is that the strings
are examined, but not modified. Particularly common is a relational
restriction: abort processing that row (for that particular chain of
Conditions and Results) if a data item doesn't match a particular value,
or fall in between some particular values, or whatever. Actually needing
to copy a value is relatively rare. More frequent would be incrementing
a counter if a value matches some condition, or adding the value to a
total.

I should also explain that having these pseudo-strings as actual Ruby
objects immediately after parsing is partly a convenience; in something
that needed to be really fast, the initial Conditions, at least, and
later ones and Results, if they match frequently enough, would be in C,
and I suppose I could delay creating the String objects until I had to
call a Ruby routine.

But as it stands, even the callbacks to Ruby are surprisingly fast, if
they don't do a lot. On my machine, for an 800,000 line file (entirely
cached in memory), a scan running only C code takes 1.8 seconds of
CPU. Doing an rb_str_new on each line brings it up to 2.3 seconds;
doing instead a call into a Ruby function that does a simple string
comparison against an instance variable takes 3.4 seconds.

But I'd say chances are that client code will do more complex
manipulations....

As above; in many cases the intial comparisons to discard the line can
be done by simple C code.

Here's another variant: if you mmap the file and it fits into mem...

Sometimes not my case, unfortunately; I have to be able to deal with
files several gigabytes long on a 32-bit machine.

Thanks for all of the useful advice.

cjs

···

On 2007-12-11 17:22 +0900 (Tue), Robert Klemme wrote:

2007/12/11, Tim Hunter <TimHunter@nc.rr.com>:

--
Curt Sampson <cjs@starling-software.com> +81 90 7737 2974
Mobile sites and software consulting: http://www.starling-software.com

Adam_Shelly · 11 December 2007 19:05

Would the immutable Rope implementation from RubyQuiz 137 help here?

···

On 12/11/07, Curt Sampson <cjs@cynic.net> wrote:

On 2007-12-11 17:22 +0900 (Tue), Robert Klemme wrote:
> 2007/12/11, Tim Hunter <TimHunter@nc.rr.com>:
>
> > For a problem of this scale, it seems like it would make sense to use a
> > custom class that had some of the methods of String - enough for the
> > callees to treat it like a String - but not in fact String. Give it a
> > .to_s method to convert to a real String when it's really necessary.
>

I should perhaps explain that the common case is that the strings
are examined, but not modified. Particularly common is a relational
restriction: abort processing that row (for that particular chain of
Conditions and Results) if a data item doesn't match a particular value,
or fall in between some particular values, or whatever. Actually needing
to copy a value is relatively rare. More frequent would be incrementing
a counter if a value matches some condition, or adding the value to a
total.

I should also explain that having these pseudo-strings as actual Ruby
objects immediately after parsing is partly a convenience; in something
that needed to be really fast, the initial Conditions, at least, and
later ones and Results, if they match frequently enough, would be in C,
and I suppose I could delay creating the String objects until I had to
call a Ruby routine.

Topic		Replies	Views
String copy-on-write question ruby-talk	2	123	5 May 2008
Copying data into a String inside a C extension ruby-talk	1	130	12 January 2008
Creating new Ruby strings in C ruby-talk	2	128	7 December 2003
Implementation of String or Array #slice and similar ops? ruby-talk	8	114	5 December 2005
Avoiding memcpy() during String creation in a C extension ruby-talk	3	116	8 February 2008

Zero-copy String Handling

Related topics