Question of reference and (sub)strings

Given that I only want to compute the offsets once, an
obvious solution
would be to construct an Array of String - each element
representing a
sub-string of the original... but this would double memory use. What
would be the best way to avoid duplicating the character
sequences and
causing run-time bloat?

I might be wrong - but I'm pretty sure that substrings in ruby are
created with copy-on-write. That is, when you take a substring, a new
block of memory isn't allocated to the new String, it references the
same block of memory as the original string - the allocation of a new
block of memory only occurs when one of the strings is modified.

···

#####################################################################################
This email has been scanned by MailMarshal, an email content filter.
#####################################################################################

Daniel Sheppard wrote:

Given that I only want to compute the offsets once, an
obvious solution
would be to construct an Array of String - each element
representing a
sub-string of the original... but this would double memory use. What
would be the best way to avoid duplicating the character
sequences and
causing run-time bloat?

I might be wrong

You're not.

- but I'm pretty sure that substrings in ruby are
created with copy-on-write. That is, when you take a substring, a new
block of memory isn't allocated to the new String, it references the
same block of memory as the original string - the allocation of a new
block of memory only occurs when one of the strings is modified.

Exactly. It seems this would be the simplest solution.

    robert

I asked about this a week or two ago, as part of my Extended Set Theory thing,
and the impression I got from the answers then was that sub /arrays/ are copy on
write but that sub /strings/ were not.

I wonder how one might confirm this, other than by asking matz or reading the
compiler and libraries ...

···

On Thu, 15 Dec 2005 14:50:59 +0900, "Daniel Sheppard" <daniels@pronto.com.au> wrote:

I might be wrong - but I'm pretty sure that substrings in ruby are
created with copy-on-write. That is, when you take a substring, a new
block of memory isn't allocated to the new String, it references the
same block of memory as the original string - the allocation of a new
block of memory only occurs when one of the strings is modified.

--
Ron Jeffries
www.XProgramming.com
I'm giving the best advice I have. You get to decide if it's true for you.

Robert Klemme wrote:

I might be wrong
    

You're not.
  

- but I'm pretty sure that substrings in ruby are
created with copy-on-write. That is, when you take a substring, a new
block of memory isn't allocated to the new String, it references the
same block of memory as the original string - the allocation of a new
block of memory only occurs when one of the strings is modified.
    

Exactly. It seems this would be the simplest solution.

That sounds like absolutely great news - a _very_ pleasant surprise. I couldn't have hoped for anything better. (Thanks!)

I assume that if I do something like:

    # Assume offsets is a pre-computed array of positive integer positions into the String originalstr.
    # with offsets[0]==0 and offsets[-1]==@originalstr.size
    @fields=Array.new (offsets.size-1)
    for i in 1..(offsets.size) do
      # I assume this next line is what is meant by a Ruby sub-string?
      @fields[i-1]=@originalstr[offsets[i-1]..offsets[i]]
    end

... and, assuming that @fields is exposed only as a read-only attribute, that I can assume the memory it consumes to be independent of the length of originalstr and dependent only upon numfields?

While I've no reason to doubt this confirmed answer, by any chance can someone suggest a good way to demonstrate that this is the case without resorting to either using very large strings and looking at VM usage of the interpreter process... or resorting to reviewing the source to Ruby's implementation?

Ron Jeffries wrote:

I might be wrong - but I'm pretty sure that substrings in ruby are
created with copy-on-write. That is, when you take a substring, a new
block of memory isn't allocated to the new String, it references the
same block of memory as the original string - the allocation of a new
block of memory only occurs when one of the strings is modified.
    

I asked about this a week or two ago, as part of my Extended Set Theory thing,
and the impression I got from the answers then was that sub /arrays/ are copy on
write but that sub /strings/ were not.
  

Ah-ha... Hmmm - that suggestion "puts the cat among the pigeons" - as far as I can tell, I can't have an array of bytes represented in contiguous memory... (arrays are always arrays of arbitrarily typed objects - aren't they?) - so if copy on write is limited to arrays, it is back to the drawing board for me. I think arrays of objects would necessarily be copy on write as a consequence of the Ruby model of references... Conversely String values are something entirely different - while substrings could be done 'copy on write' - it would necessarily use an entirely separate mechanism. It is obvious that whole strings behave as if they are "copy on write."

I wonder how one might confirm this, other than by asking matz or reading the
compiler and libraries ...

I'm coming towards the opinion that I will need to "ask Matz" - though I hope to have exhausted all the other avenues before I resort to bugging him in person. I realise that I could (in principle) read the source code for the ruby interpreter - but, ideally, I'd also like to know how it is intended to work (in order that I can be confident that the details will not change in a future release.) I also presume it would take me quite a while to get to grips with the interpreter source.

Just the following should do:

···

On Fri, Dec 16, 2005 at 08:17:40AM +0900, Ron Jeffries wrote:

On Thu, 15 Dec 2005 14:50:59 +0900, "Daniel Sheppard" <daniels@pronto.com.au> > wrote:

>I might be wrong - but I'm pretty sure that substrings in ruby are
>created with copy-on-write. That is, when you take a substring, a new
>block of memory isn't allocated to the new String, it references the
>same block of memory as the original string - the allocation of a new
>block of memory only occurs when one of the strings is modified.

I asked about this a week or two ago, as part of my Extended Set Theory thing,
and the impression I got from the answers then was that sub /arrays/ are copy on
write but that sub /strings/ were not.

I wonder how one might confirm this, other than by asking matz or reading the
compiler and libraries ...

-----------------------------------------------------------------------------
VALUE
rb_str_substr(str, beg, len)
    VALUE str;
    long beg, len;
{
    VALUE str2;

    if (len < 0) return Qnil;
    if (beg > RSTRING(str)->len) return Qnil;
    if (beg < 0) {
  beg += RSTRING(str)->len;
  if (beg < 0) return Qnil;
    }
    if (beg + len > RSTRING(str)->len) {
  len = RSTRING(str)->len - beg;
    }
    if (len < 0) {
  len = 0;
    }
    if (len == 0) {
  str2 = rb_str_new5(str,0,0);
    }
    else if (len > sizeof(struct RString)/2 &&
  beg + len == RSTRING(str)->len && !FL_TEST(str, STR_ASSOC)) {
/* ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ */
  str2 = rb_str_new3(rb_str_new4(str));
  RSTRING(str2)->ptr += RSTRING(str2)->len - len;
  RSTRING(str2)->len = len;
    }
    else {
  str2 = rb_str_new5(str, RSTRING(str)->ptr+beg, len);
    }
    OBJ_INFECT(str2, str);

    return str2;
}
-----------------------------------------------------------------------------

... (optionally) ...

-----------------------------------------------------------------------------
static VALUE
str_new3(klass, str)
    VALUE klass, str;
{
    VALUE str2 = str_alloc(klass);

    RSTRING(str2)->len = RSTRING(str)->len;
    RSTRING(str2)->ptr = RSTRING(str)->ptr;
    RSTRING(str2)->aux.shared = str;
    FL_SET(str2, ELTS_SHARED);
    OBJ_INFECT(str2, str);

    return str2;
}

VALUE
rb_str_new3(str)
    VALUE str;
{
    return str_new3(rb_obj_class(str), str);
}

static VALUE
str_new4(klass, str)
    VALUE klass, str;
{
    VALUE str2 = str_alloc(klass);

    RSTRING(str2)->len = RSTRING(str)->len;
    RSTRING(str2)->ptr = RSTRING(str)->ptr;
    if (FL_TEST(str, ELTS_SHARED)) {
  FL_SET(str2, ELTS_SHARED);
  RSTRING(str2)->aux.shared = RSTRING(str)->aux.shared;
    }
    else {
  FL_SET(str, ELTS_SHARED);
  RSTRING(str)->aux.shared = str2;
    }

    return str2;
}

VALUE
rb_str_new4(orig)
    VALUE orig;
{
    VALUE klass, str;

    if (OBJ_FROZEN(orig)) return orig;
    klass = rb_obj_class(orig);
    if (FL_TEST(orig, ELTS_SHARED) && (str = RSTRING(orig)->aux.shared) && klass == RBASIC(str)->klass) {
  long ofs;
  ofs = RSTRING(str)->len - RSTRING(orig)->len;
  if (ofs > 0) {
      str = str_new3(klass, str);
      RSTRING(str)->ptr += ofs;
      RSTRING(str)->len -= ofs;
  }
    }
    else if (FL_TEST(orig, STR_ASSOC)) {
  str = str_new(klass, RSTRING(orig)->ptr, RSTRING(orig)->len);
    }
    else {
  str = str_new4(klass, orig);
    }
    OBJ_INFECT(str, orig);
    OBJ_FREEZE(str);
    return str;
}

VALUE
rb_str_new5(obj, ptr, len)
    VALUE obj;
    const char *ptr;
    long len;
{
    return str_new(rb_obj_class(obj), ptr, len);
}
-----------------------------------------------------------------------------

--
Mauricio Fernandez

Steve [RubyTalk] wrote:

Robert Klemme wrote:

I might be wrong

You're not.

- but I'm pretty sure that substrings in ruby are
created with copy-on-write. That is, when you take a substring, a
new block of memory isn't allocated to the new String, it
references the same block of memory as the original string - the
allocation of a new block of memory only occurs when one of the
strings is modified.

Exactly. It seems this would be the simplest solution.

That sounds like absolutely great news - a _very_ pleasant surprise. I
couldn't have hoped for anything better. (Thanks!)

I assume that if I do something like:

    # Assume offsets is a pre-computed array of positive integer
positions into the String originalstr.

Care to unveil a bit of the nature of the computation that yields those
indexes?

    # with offsets[0]==0 and offsets[-1]==@originalstr.size
    @fields=Array.new (offsets.size-1)
    for i in 1..(offsets.size) do
      # I assume this next line is what is meant by a Ruby sub-string?
      @fields[i-1]=@originalstr[offsets[i-1]..offsets[i]]
    end

.. and, assuming that @fields is exposed only as a read-only
attribute, that I can assume the memory it consumes to be independent
of the length of originalstr and dependent only upon numfields?

You can help keeping this read only be freezing all strings involved.

While I've no reason to doubt this confirmed answer, by any chance can
someone suggest a good way to demonstrate that this is the case
without resorting to either using very large strings and looking at
VM usage of the interpreter process... or resorting to reviewing the
source to Ruby's implementation?

The only additional method of verification that comes to mind is to ask
Matz. :slight_smile:

Kind regards

    robert

....

Easy for you to say, Maruicio ... but whanging a little C code in there is out
of my current scope. ;->

I'd just like to know what's really going on, and am hoping, unfairly, that
someone out there knows or will figure it out.

Thanks,

···

On Fri, 16 Dec 2005 19:33:16 +0900, Mauricio Fernandez <mfp@acm.org> wrote:

ust the following should do:

-----------------------------------------------------------------------------
VALUE
rb_str_substr(str, beg, len)
   VALUE str;
   long beg, len;
{
   VALUE str2;
... and so on ...

--
Ron Jeffries
www.XProgramming.com
I'm giving the best advice I have. You get to decide if it's true for you.

I'm with you ... when you find out, please be sure to let me know!

···

On Fri, 16 Dec 2005 19:09:17 +0900, "Steve [RubyTalk]" <steve_rubytalk@shic.co.uk> wrote:

I'm coming towards the opinion that I will need to "ask Matz" - though I
hope to have exhausted all the other avenues before I resort to bugging
him in person.

--
Ron Jeffries
www.XProgramming.com
I'm giving the best advice I have. You get to decide if it's true for you.

Robert Klemme wrote:

    # Assume offsets is a pre-computed array of positive integer
positions into the String originalstr.
    

Care to unveil a bit of the nature of the computation that yields those
indexes?
  

It's not really relevant to the question I was asking - but I've no problem saying more. I've a domain specific (order-preserving and extensible) 'type-system' which is imposed over otherwise opaque data structures. Given an instance of a 'type-signature' and a pointer, it is possible to determine the number of bytes which represent each 'typed value' - and (significantly) list construction drops out as being the concatenation of the value representations and type-signatures. The type signatures range in complexity from the simplest constant 'N-bytes interpreted as a natural number' through sentinel encodings (Null terminated strings on steroids) and (in principle - if not frequently in practice) arbitrary computation ranging over named integer values occurring 'earlier' in the list.
At the moment I'm toying with the idea that I can memory-map the values (using a C-implemented module) and do the computations on the mapped values in Ruby - having presented opaque values and 'type-signatures' as String objects to Ruby. I expect that typical computations may involve matching regular expressions; doing arithmetic; computing various hashes and summations etc. At the moment I'm concentrating on establishing if Ruby is a suitable tool for the task at hand.

    # with offsets[0]==0 and offsets[-1]==@originalstr.size
    @fields=Array.new (offsets.size-1)
    for i in 1..(offsets.size) do
      # I assume this next line is what is meant by a Ruby sub-string?
      @fields[i-1]=@originalstr[offsets[i-1]..offsets[i]]
    end

.. and, assuming that @fields is exposed only as a read-only
attribute, that I can assume the memory it consumes to be independent
of the length of originalstr and dependent only upon numfields?
    

You can help keeping this read only be freezing all strings involved.
  

Yes - that sounds a good idea to me.

While I've no reason to doubt this confirmed answer, by any chance can
someone suggest a good way to demonstrate that this is the case
without resorting to either using very large strings and looking at
VM usage of the interpreter process... or resorting to reviewing the
source to Ruby's implementation?
    

The only additional method of verification that comes to mind is to ask
Matz. :slight_smile:
  

Hmmm - a lack of profiling tools might prove something of a stumbling block... I'll need to have a careful think about that. Rather than wanting to check up on fellow Rubyists, I really want to periodically check that I make no invalid assumptions as I work forwards from this basis towards an implementation. I don't want to find out only after I think I've finished that a resource leak or extravagant resource demands will require a re-write before the software can be used against real data.

Steve

Steve [RubyTalk] wrote:

Robert Klemme wrote:

    # Assume offsets is a pre-computed array of positive integer
positions into the String originalstr.

Care to unveil a bit of the nature of the computation that yields
those indexes?

It's not really relevant to the question I was asking - but I've no
problem saying more. I've a domain specific (order-preserving and
extensible) 'type-system' which is imposed over otherwise opaque data
structures. Given an instance of a 'type-signature' and a pointer, it
is possible to determine the number of bytes which represent each
'typed value' - and (significantly) list construction drops out as
being the concatenation of the value representations and
type-signatures. The
type signatures range in complexity from the simplest constant
'N-bytes interpreted as a natural number' through sentinel encodings
(Null terminated strings on steroids) and (in principle - if not
frequently in practice) arbitrary computation ranging over named
integer values occurring 'earlier' in the list.
At the moment I'm toying with the idea that I can memory-map the
values (using a C-implemented module) and do the computations on the
mapped values in Ruby - having presented opaque values and
'type-signatures' as String objects to Ruby. I expect that typical
computations may involve matching regular expressions; doing
arithmetic; computing various hashes and summations etc. At the
moment I'm concentrating on establishing if Ruby is a suitable tool
for the task at hand.

Certainly. I don't know now complex your type calculations are. I'd
probably do something along the line of:

class SpecialData

  def initialize(s)
    @data = s.frozen? ? s : s.dup.freeze
    @meta_data = calculate_meta(@data)
  end

  def size() @meta_data.size end

  def get(field_index)
    @meta_data[field_index].extract(@data)
  end

private

  def calculate_meta(data)
    md =
    # your complex calculation here, sample:
    md << StringMeta.new(0..-1)
    md
  end

  class BaseMeta
    attr_accessor :range
    def initialize(range) @range = range end
    def base_extract(str) str[self.range].freeze end
  end

  class StringMeta < BaseMeta
    alias :extract :base_extract
  end

  class IntMeta < BaseMeta
    def extract(s) base_extract(s).unpack("i")[0] end
  end
end

    # with offsets[0]==0 and offsets[-1]==@originalstr.size
    @fields=Array.new (offsets.size-1)
    for i in 1..(offsets.size) do
      # I assume this next line is what is meant by a Ruby
      sub-string?
    @fields[i-1]=@originalstr[offsets[i-1]..offsets[i]] end

.. and, assuming that @fields is exposed only as a read-only
attribute, that I can assume the memory it consumes to be
independent of the length of originalstr and dependent only upon
numfields?

You can help keeping this read only be freezing all strings involved.

Yes - that sounds a good idea to me.

While I've no reason to doubt this confirmed answer, by any chance
can someone suggest a good way to demonstrate that this is the case
without resorting to either using very large strings and looking at
VM usage of the interpreter process... or resorting to reviewing the
source to Ruby's implementation?

The only additional method of verification that comes to mind is to
ask Matz. :slight_smile:

Hmmm - a lack of profiling tools might prove something of a stumbling
block... I'll need to have a careful think about that. Rather than
wanting to check up on fellow Rubyists, I really want to periodically
check that I make no invalid assumptions as I work forwards from this
basis towards an implementation. I don't want to find out only after
I think I've finished that a resource leak or extravagant resource
demands will require a re-write before the software can be used
against real data.

Then just look at the sources.

Kind regards

    robert