Symbols and frozen strings

I just had a thought.

One of the problems with using strings as hash keys is that every time you
refer to them, you create a throw-away garbage string:

    params["id"]
            ^
            +-- temporary string, needs to be garbage collected

In Rails you have HashWithIndifferentAccess, but this actually isn't any
better. Although you write params[:id], when executed the symbol is
converted to a string anyway.

In a Rails-like scenario, using symbols as the real keys within the hash
doesn't work: the keys come from externally parsed data, which means (a)
they were strings originally, and (b) if you converted them to symbols you'd
risk a symbol exhaustion attack.

So I thought, wouldn't it be nice to have a half-way house: being able to
converting a symbol to a string, in such a way that you always got the same
(frozen) string object?

This turned out to be extremely easy:

class Symbol
  def fring
    @fring ||= to_s.freeze
  end
end

irb(main):006:0> :foo.fring
=> "foo"
irb(main):007:0> :foo.fring.object_id
=> -605512686
irb(main):008:0> :foo.fring.object_id
=> -605512686
irb(main):009:0> :bar.fring
=> "bar"
irb(main):010:0> :bar.fring.object_id
=> -605543036
irb(main):011:0> :bar.fring.object_id
=> -605543036
irb(main):012:0> :bar.fring << "x"
TypeError: can't modify frozen string
        from (irb):12:in `<<'
        from (irb):12

Is this a well-known approach, and/or it does it exist in any extension
library?

I suppose that an instance variable lookup isn't necessarily faster than
always creating a temporary string with to_s and then garbage collecting it
at some point later in time, but it feels like it ought to be :slight_smile:

However, since I've seen discussion about string modifiers like "..."u,
perhaps there's scope for adding in-language support, e.g.

    "..."f - frozen string, same object ID each time it's executed

In that case, it might be more convenient the other way round:

   "..." - frozen string literal, same object
   "..."m - mutable (unfrozen) string literal, new objects
   String.new("...") - another way of making a mutable string
   "...".dup - and another

That would break a lot of existing code, but it could be pragma-enabled.

Sorry if this ground has been covered before - it's hard to keep up with
ruby-talk :slight_smile:

Regards,

Brian.

···

from :0

Hi,

At Thu, 6 Sep 2007 16:50:28 +0900,
Brian Candler wrote in [ruby-talk:267857]:

So I thought, wouldn't it be nice to have a half-way house: being able to
converting a symbol to a string, in such a way that you always got the same
(frozen) string object?

Rather, Symbol#to_s should return frozen String?

I suppose that an instance variable lookup isn't necessarily faster than
always creating a temporary string with to_s and then garbage collecting it
at some point later in time, but it feels like it ought to be :slight_smile:

However, since I've seen discussion about string modifiers like "..."u,
perhaps there's scope for adding in-language support, e.g.

    "..."f - frozen string, same object ID each time it's executed

What about "..."o like Regexp?

···

--
Nobu Nakada

Rather, Symbol#to_s should return frozen String?

Yes, as long as it returns the same frozen string each time.

Hmm, this sounds like a good solution - it's technically not
backwards-compatible, but I doubt that much code does a Symbol#to_s and
later mutates it.

What about "..."o like Regexp?

Sure, I don't mind about the actual syntax.

Of course, you don't even need to add 'o' to a Regexp in the case where it
doesn't contain any #{...} interpolation:

irb(main):001:0> RUBY_VERSION
=> "1.8.4"
irb(main):002:0> 3.times { puts /foo/.object_id }
-605554606
-605554606
-605554606

Regards,

Brian.

Brian Candler wrote:

I just had a thought.

One of the problems with using strings as hash keys is that every time you
refer to them, you create a throw-away garbage string:

    params["id"]
            ^
            +-- temporary string, needs to be garbage collected

Setting aside the question of freezing, why can't ruby share string data for all strings generated from the same symbol? And in that case you could do the following to avoid garbage:

      params[:id.to_s]

(Or ruby could even look up the literal "id" in the symbol table and do this for you.)

This code shows some of the cases in which ruby does and does not share string contents:

def show_vmsize
   GC.start
   puts `ps -o vsz #$$`[/\d+/]
end

s = "a"*1000
sym = s.to_sym

show_vmsize # 8712

# ruby apparently does not share storage for strings derived
# from the same symbol:

strs1 = (0..10_000).map do
   sym.to_s
end

show_vmsize # 18488

# ruby does share storage for string ops:

strs2 = (0..10_000).map do
   s[0..-1]
end

show_vmsize # 18616

strs3 = (0..10_000).map do
   s.dup
end

show_vmsize # 18616

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Absolutely.

Whether you need to care about this, though, depends on how often your code is building these throwaway strings, and on just how much you really need to neurotically performance tweak your code.

What I do to deal with this in code where I consider it important is to use constants that contain frozen strings.

Id = 'id'.freeze

params[Id]

Constant lookup isn't the fastest thing in Ruby, but it's faster than the combined load of creating the throwaway string object, and then garbage collecting it.

Kirk Haines

···

On Thu, 6 Sep 2007, Brian Candler wrote:

I just had a thought.

One of the problems with using strings as hash keys is that every time you
refer to them, you create a throw-away garbage string:

   params["id"]
           ^
           +-- temporary string, needs to be garbage collected

I just had a thought.

<snip>

However, since I've seen discussion about string modifiers like "..."u,
perhaps there's scope for adding in-language support, e.g.

    "..."f - frozen string, same object ID each time it's executed

In that case, it might be more convenient the other way round:

   "..." - frozen string literal, same object
   "..."m - mutable (unfrozen) string literal, new objects
   String.new("...") - another way of making a mutable string
   "...".dup - and another

Rubinius has a compiler extension that detects code in the form of

  "name".static

Inside the quotes can be any string, and the static method call is
removed,
but everytime the code is run, the same String object is returned.
This is
highly useful when using strings as hash keys, and avoids having to
put them
in constants that must be looked up later.

That would break a lot of existing code, but it could be pragma-enabled.

Sorry if this ground has been covered before - it's hard to keep up with
ruby-talk :slight_smile:

Regards,

Brian.

- Evan Phoenix

···

On Sep 6, 12:50 am, Brian Candler <B.Cand...@pobox.com> wrote:

I've tried that. There are some places where it blows up Ruby. So
those would have to be rooted out first.

T.

···

On Sep 6, 5:10 am, Brian Candler <B.Cand...@pobox.com> wrote:

> Rather, Symbol#to_s should return frozen String?

Yes, as long as it returns the same frozen string each time.

Hmm, this sounds like a good solution - it's technically not
backwards-compatible, but I doubt that much code does a Symbol#to_s and
later mutates it.

Joel VanderWerf wrote:

Brian Candler wrote:

I just had a thought.

One of the problems with using strings as hash keys is that every time you
refer to them, you create a throw-away garbage string:

    params["id"]
            ^
            +-- temporary string, needs to be garbage collected

Setting aside the question of freezing, why can't ruby share string data for all strings generated from the same symbol? And in that case you could do the following to avoid garbage:

     params[:id.to_s]

Sorry... _reduce_ garbage, not avoid it altogether, since there is still the T_STRING, even though the data is reused. It would help more for long strings than for short strings, because the T_DATA is smaller in proportion.

The idea of a literal for a unique frozen string would reduce garbage further, sharing the T_STRING as well as the data.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Setting aside the question of freezing, why can't ruby share string data
for all strings generated from the same symbol?

Because it could generate unexpected aliasing. The normal, expected
behaviour is no aliasing:

irb(main):001:0> a = :foo.to_s
=> "foo"
irb(main):002:0> b = :foo.to_s
=> "foo"
irb(main):003:0> b << "bar"
=> "foobar"
irb(main):004:0> a
=> "foo"

That's why the string has to be frozen.

Regards,

Brian.

Joel VanderWerf wrote:

show_vmsize # 8712

# ruby apparently does not share storage for strings derived
# from the same symbol:

strs1 = (0..10_000).map do
  sym.to_s
end

show_vmsize # 18488

# ruby does share storage for string ops:

strs2 = (0..10_000).map do
  s[0..-1]
end

Hmm, we could use that property of strings...

   class Symbol
     alias _to_s to_s
     def to_s
       (@str || @str = _to_s)[0..-1]
     end
   end

Daniel

I always prefer less intrusive solutions. Why not do this:

SYMS = Hash.new {|h,sy| h[sy]=sy.to_s}

Then, wherever you need this, just do "SYMS[a_sym]" instead
"a_sym.to_s". Added advantage, you can throw away or clear SYMS when
you know you do not need it any more thusly freeing up memory.

Kind regards

robert

···

2007/9/6, Trans <transfire@gmail.com>:

On Sep 6, 5:10 am, Brian Candler <B.Cand...@pobox.com> wrote:
> > Rather, Symbol#to_s should return frozen String?
>
> Yes, as long as it returns the same frozen string each time.
>
> Hmm, this sounds like a good solution - it's technically not
> backwards-compatible, but I doubt that much code does a Symbol#to_s and
> later mutates it.

I've tried that. There are some places where it blows up Ruby. So
those would have to be rooted out first.

Joel VanderWerf wrote:

Sorry... _reduce_ garbage, not avoid it altogether, since there is still the T_STRING, even though the data is reused. It would help more for long strings than for short strings, because the T_DATA is smaller in proportion.

Sorry again... I don't know where T_DATA came from. Should be T_STRING, the constant-size overhead for a string object. Will stop posting until caffeine hits.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Brian Candler wrote:

Setting aside the question of freezing, why can't ruby share string data for all strings generated from the same symbol?

Because it could generate unexpected aliasing. The normal, expected
behaviour is no aliasing:

irb(main):001:0> a = :foo.to_s
=> "foo"
irb(main):002:0> b = :foo.to_s
=> "foo"
irb(main):003:0> b << "bar"
=> "foobar"
irb(main):004:0> a
=> "foo"

This was what I was thinking of:

irb(main):001:0> a = :foo.to_s
=> "foo"
irb(main):002:0> b = a.dup
=> "foo"
irb(main):003:0> b << "bar"
=> "foobar"
irb(main):004:0> a
=> "foo"

Internally, a and b use the same storage, but copy-on-write prevents aliasing.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407