This is my first time on a mailing list, so go easy on me.
Hopefully a specific question with a simple yes/no answer:
From the docs for String#[offset,length]="blah" it seemed to imply that if
the length of the new fragment was the same as the length being replaced,
nothing would need to be re-allocated.
Yet looking through the source (briefly) it seems like no mater what, then
entire string gets built from scratch every time.
Can anyone confirm:
a) that I've interpreted things correctly and a new string is built every
time, regardless of the lengths matching.
b) If that is the case, is there in fact any way to replace a like sized
fragment of a string that does indeed simply splat (technical term) the new
data over the old data.
ps. I know I can use String#setbyte on a byte by byte basis, but that would
be a pain if wanted to replace a 5k fragment in the middle of a 20mb string.
pps. Also happy to hear that I've completely misunderstood the
String#[num,num]='xxx' code.
This is my first time on a mailing list, so go easy on me.
Welcome!
Hopefully a specific question with a simple yes/no answer:
From the docs for String#[offset,length]="blah" it seemed to imply that if
the length of the new fragment was the same as the length being replaced,
nothing would need to be re-allocated.
Yet looking through the source (briefly) it seems like no mater what, then
entire string gets built from scratch every time.
Can anyone confirm:
a) that I've interpreted things correctly and a new string is built every
time, regardless of the lengths matching.
IIRC String uses copy on write internally so it has to create a new
byte array internally - at least if the byte array is shared. There is
a flag STR_SHARED: https://github.com/ruby/ruby/blob/trunk/string.c
b) If that is the case, is there in fact any way to replace a like sized
fragment of a string that does indeed simply splat (technical term) the new
data over the old data.
ps. I know I can use String#setbyte on a byte by byte basis, but that would
be a pain if wanted to replace a 5k fragment in the middle of a 20mb string.
Even that might create a new internal byte array because of copy on
write (see above).
pps. Also happy to hear that I've completely misunderstood the
String#[num,num]='xxx' code.
I don't think you have to worry about these internals. Why do you
think you need?
Kind regards
robert
···
On Wed, Jun 22, 2016 at 2:04 PM, Sophie Wellow <sophie.wellow@gmail.com> wrote:
This is my first time on a mailing list, so go easy on me.
Hopefully a specific question with a simple yes/no answer:
From the docs for String#[offset,length]="blah" it seemed to imply that if
the length of the new fragment was the same as the length being replaced,
nothing would need to be re-allocated.
Yet looking through the source (briefly) it seems like no mater what, then
entire string gets built from scratch every time.
Can anyone confirm:
a) that I've interpreted things correctly and a new string is built every
time, regardless of the lengths matching.
b) If that is the case, is there in fact any way to replace a like sized
fragment of a string that does indeed simply splat (technical term) the new
data over the old data.
ps. I know I can use String#setbyte on a byte by byte basis, but that
would be a pain if wanted to replace a 5k fragment in the middle of a 20mb
string.
pps. Also happy to hear that I've completely misunderstood the
String#[num,num]='xxx' code.
Thanks.
Looking in the 2.3.0 documentation
<Class: String (Ruby 2.3.0); I only see
this: "If the replacement string is not the same length as the text it is
replacing, the string will be adjusted accordingly." To me, that just says
that the String will grow or shrink appropriately to match the replacement;
it says nothing about reallocations, which is presumably an implementation
detail.
To that end, 'ruby' (aka "MRI" or "CRuby") is only one implementation of
the Ruby language; maybe jruby or rubinius do something different under the
hood. (I'd bet quite strongly that jruby does )
That said, from what I can see in string.c in trunk
<https://github.com/ruby/ruby/blob/trunk/string.c#L4163> (i.e ruby 2.4) MRI
does appear to be reusing the same physical memory location for the string:
shuffling the tail to the left or right if necessary, then blitting the
replacement value. (I didn't track down what happens when the range
parameter is a regexp.)
Cheers
···
On 22 June 2016 at 22:04, Sophie Wellow <sophie.wellow@gmail.com> wrote:
--
Matthew Kerwin http://matthew.kerwin.net.au/
That's the doc fragment indeed. I'd taken "If the replacement string is
not the same length as the text it is replacing, the string will be
adjusted accordingly." to imply that if the replacement string WAS the same
size then the string would not be "adjusted", but I guess I'd inferred too
much into that.
I can't get 2.4 to install yet... but I'll try it as soon as it does.
Thanks so much.
···
On Wed, Jun 22, 2016 at 2:31 PM, Matthew Kerwin <matthew@kerwin.net.au> wrote:
Hi Sophie,
On 22 June 2016 at 22:04, Sophie Wellow <sophie.wellow@gmail.com> wrote:
This is my first time on a mailing list, so go easy on me.
Hopefully a specific question with a simple yes/no answer:
From the docs for String#[offset,length]="blah" it seemed to imply that
if the length of the new fragment was the same as the length being
replaced, nothing would need to be re-allocated.
Yet looking through the source (briefly) it seems like no mater what,
then entire string gets built from scratch every time.
Can anyone confirm:
a) that I've interpreted things correctly and a new string is built every
time, regardless of the lengths matching.
b) If that is the case, is there in fact any way to replace a like sized
fragment of a string that does indeed simply splat (technical term) the new
data over the old data.
ps. I know I can use String#setbyte on a byte by byte basis, but that
would be a pain if wanted to replace a 5k fragment in the middle of a 20mb
string.
pps. Also happy to hear that I've completely misunderstood the
String#[num,num]='xxx' code.
Thanks.
Looking in the 2.3.0 documentation
<Class: String (Ruby 2.3.0); I only see
this: "If the replacement string is not the same length as the text it
is replacing, the string will be adjusted accordingly." To me, that just
says that the String will grow or shrink appropriately to match the
replacement; it says nothing about reallocations, which is presumably an
implementation detail.
To that end, 'ruby' (aka "MRI" or "CRuby") is only one implementation of
the Ruby language; maybe jruby or rubinius do something different under the
hood. (I'd bet quite strongly that jruby does )
That said, from what I can see in string.c in trunk
<https://github.com/ruby/ruby/blob/trunk/string.c#L4163> (i.e ruby 2.4)
MRI does appear to be reusing the same physical memory location for the
string: shuffling the tail to the left or right if necessary, then blitting
the replacement value. (I didn't track down what happens when the range
parameter is a regexp.)
Thanks for the replies - the reason I am interested is that I'm amusing
myself by writing a simple little image drawing library - just for my use
as an exercise. I'm using strings to hold the binary image data. I was
brought up on C/C++ so I'm all too aware of the kind of stuff going on
under the hood and that speed won't be super..... having said that it still
seems a bit heavy that it may not even be possible to change a single byte
in an 8Mb string without a complete copy / reallocation each time... If
there's a better basic class to use than String then I'd happily use that
instead.
···
On Wed, Jun 22, 2016 at 3:03 PM, Robert Klemme <shortcutter@googlemail.com> wrote:
On Wed, Jun 22, 2016 at 2:04 PM, Sophie Wellow <sophie.wellow@gmail.com> > wrote:
> This is my first time on a mailing list, so go easy on me.
Welcome!
> Hopefully a specific question with a simple yes/no answer:
>
> From the docs for String#[offset,length]="blah" it seemed to imply that
if
> the length of the new fragment was the same as the length being replaced,
> nothing would need to be re-allocated.
>
> Yet looking through the source (briefly) it seems like no mater what,
then
> entire string gets built from scratch every time.
>
> Can anyone confirm:
>
> a) that I've interpreted things correctly and a new string is built every
> time, regardless of the lengths matching.
IIRC String uses copy on write internally so it has to create a new
byte array internally - at least if the byte array is shared. There is
a flag STR_SHARED: https://github.com/ruby/ruby/blob/trunk/string.c
> b) If that is the case, is there in fact any way to replace a like sized
> fragment of a string that does indeed simply splat (technical term) the
new
> data over the old data.
> ps. I know I can use String#setbyte on a byte by byte basis, but that
would
> be a pain if wanted to replace a 5k fragment in the middle of a 20mb
string.
Even that might create a new internal byte array because of copy on
write (see above).
> pps. Also happy to hear that I've completely misunderstood the
> String#[num,num]='xxx' code.
I don't think you have to worry about these internals. Why do you
think you need?
I do not have the kind of intimate familiarity with Ruby MRI's C code
and I am lacking time to establish it to give definitive answers. Did
you verify that what you claim above is true under all circumstances?
Because, from the brief inspection that I did yesterday I believe that
there are situations where the data sitting inside the String instance
is _not_ copied. This is a quite obvious optimization for the case
that String data is not shared with another String instance so I would
be very surprised if that was not implemented.
Apart from that, if you write an image manipulating library an
alternative approach is to not (ab) use String for this but create
your own class with proper access and conversion methods. Then you can
optimize memory handling the way you need it to be. You can even
provide a similar set of methods that String offers (e.g. each_byte)
to make switching easier. Whether that approach is better in your case
I don't know. But this is an option that you could consider.
Kind regards
robert
···
On Thu, Jun 23, 2016 at 1:11 AM, Sophie Wellow <sophie.wellow@gmail.com> wrote:
Thanks for the replies - the reason I am interested is that I'm amusing
myself by writing a simple little image drawing library - just for my use as
an exercise. I'm using strings to hold the binary image data. I was
brought up on C/C++ so I'm all too aware of the kind of stuff going on under
the hood and that speed won't be super..... having said that it still seems
a bit heavy that it may not even be possible to change a single byte in an
8Mb string without a complete copy / reallocation each time... If there's a
better basic class to use than String then I'd happily use that instead.
On Wed, Jun 22, 2016 at 4:11 PM, Sophie Wellow <sophie.wellow@gmail.com> wrote:
Thanks for the replies - the reason I am interested is that I'm amusing
myself by writing a simple little image drawing library - just for my use
as an exercise. I'm using strings to hold the binary image data. I was
brought up on C/C++ so I'm all too aware of the kind of stuff going on
under the hood and that speed won't be super..... having said that it still
seems a bit heavy that it may not even be possible to change a single byte
in an 8Mb string without a complete copy / reallocation each time... If
there's a better basic class to use than String then I'd happily use that
instead.
Thanks for that - will take it on board. I finally manages to make some
simple sample code reliably be "strange":
require 'benchmark'
class Bob
SAMPLE = 4
def initialize( w, h ) @w = w @h = h @buffer = "\x00"*(@w*@h*SAMPLE)
end
def fill1
line = "\x01\x02\x03\x04" * @w #line = [0x01020304].pack("L<")* @w @h.times do |y| @buffer[y*@w*SAMPLE,@w*SAMPLE] = line
end
end
def fill2 #line = "\x01\x02\x03\x04" * @w
line = [0x01020304].pack("L<")* @w @h.times do |y| @buffer[y*@w*SAMPLE,@w*SAMPLE] = line
end
end
end
b = Bob.new( 2048, 1024 )
Benchmark.bm(10) do |x|
x.report('fill1') { b.fill1 }
x.report('fill2') { b.fill2 }
end
On Ruby 1.9.3:
user system total real
fill1 0.000000 0.000000 0.000000 ( 0.001082)
fill2 0.530000 0.000000 0.530000 ( 0.535084)
On Ruby 2.3.1:
user system total real
fill1 0.850000 0.000000 0.850000 ( 0.843473)
fill2 0.770000 0.000000 0.770000 ( 0.777828)
Notice that the only difference between fill1 and fill2 is how it made the
replacement fragment - not the size or content of the fragment.
Notice also that my nice fast fill1 on 1.9.3 is now dead as a dodo on 2.3.1
Whilst I'm not wanting to get bogged down in fine tuning, the difference
between 0.0 seconds and 0.8 seconds is a little concerning.
All very curious. Please don't wade through tonnes of code looking at
this, it's just a play thing, but any quick observations would be
interesting.
···
On Thu, Jun 23, 2016 at 8:32 AM, Robert Klemme <shortcutter@googlemail.com> wrote:
On Thu, Jun 23, 2016 at 1:11 AM, Sophie Wellow <sophie.wellow@gmail.com> > wrote:
> Thanks for the replies - the reason I am interested is that I'm amusing
> myself by writing a simple little image drawing library - just for my
use as
> an exercise. I'm using strings to hold the binary image data. I was
> brought up on C/C++ so I'm all too aware of the kind of stuff going on
under
> the hood and that speed won't be super..... having said that it still
seems
> a bit heavy that it may not even be possible to change a single byte in
an
> 8Mb string without a complete copy / reallocation each time... If
there's a
> better basic class to use than String then I'd happily use that instead.
I do not have the kind of intimate familiarity with Ruby MRI's C code
and I am lacking time to establish it to give definitive answers. Did
you verify that what you claim above is true under all circumstances?
Because, from the brief inspection that I did yesterday I believe that
there are situations where the data sitting inside the String instance
is _not_ copied. This is a quite obvious optimization for the case
that String data is not shared with another String instance so I would
be very surprised if that was not implemented.
Apart from that, if you write an image manipulating library an
alternative approach is to not (ab) use String for this but create
your own class with proper access and conversion methods. Then you can
optimize memory handling the way you need it to be. You can even
provide a similar set of methods that String offers (e.g. each_byte)
to make switching easier. Whether that approach is better in your case
I don't know. But this is an option that you could consider.
I believe that if we are seeing performance degradation and possibly
missing functionality it would be worth to raise the issue in ruby-core
mailing list.
Core developers will be able to discuss it properly and address any issue
that may exist.
Contributions like yours are very well welcome.
If it turns out not to be a bug but rather an actually desired feature the
bug issue will be closed otherwise actions will be put forward in order to
resolve it.
Feel free to interact directly with the ruby core for questions of this
level.
They are very friendly and even Matz itself replies very often to questions
like yours.
Happy ruby coding,
Daniel
···
On Thursday, 23 June 2016, Sophie Wellow <sophie.wellow@gmail.com> wrote:
On 1.9.3, in your case, I get:
user system total real
fill1 0.080000 0.000000 0.080000 ( 0.073460)
fill2 61.350000 0.260000 61.610000 ( 61.977526)
On Thu, Jun 23, 2016 at 11:39 AM, Sophie Wellow <sophie.wellow@gmail.com > <javascript:_e(%7B%7D,'cvml','sophie.wellow@gmail.com');>> wrote:
You're asking for almost a Gig of memory there...
On Thu, Jun 23, 2016 at 11:29 AM, A Berger <aberger7890@gmail.com >> <javascript:_e(%7B%7D,'cvml','aberger7890@gmail.com');>> wrote:
You do not have to change the encoding of the whole file. For your
case it should be sufficient to just force the encoding for the
strings that you use:
irb(main):001:0> s = "a"
=> "a"
irb(main):002:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):003:0> s.force_encoding 'BINARY'
=> "a"
irb(main):004:0> s.encoding
=> #<Encoding:ASCII-8BIT>
Alternatively:
irb(main):005:0> s.force_encoding Encoding::BINARY
=> "a"
irb(main):006:0> s.encoding
=> #<Encoding:ASCII-8BIT>
Btw. can you please bottom post? Thank you!
Kind regards
robert
···
On Thu, Jun 23, 2016 at 1:26 PM, Sophie Wellow <sophie.wellow@gmail.com> wrote:
I've only ever had occasional use for the # encoding ... fragment, and I
didn't even realise that BINARY was an option.