PSA: String memory use reduction techniques

Hopefully some of you are trying and enjoying Ruby 2.5 by now. I figured I’d write about some changes to C Ruby over the years which make it easier to reduce memory use.

String objects are often to blame for high memory usage in Ruby applications. High memory usage limits scalability and hurts performance by increasing memory traffic (GC overhead and general access times).

Frozen string literals have been proposed as the default in Ruby 3 but I remain against them for compatibility. Meanwhile, Ruby has gained some transparent optimizations along with some syntactic improvements to help programmers reduce overheads further.

The String#-@ method was introduced back in Ruby 2.3 as syntactic sugar for making frozen strings more succinctly than String#freeze:

# https://bugs.ruby-lang.org/issues/11782
-"this string is frozen" # became equivalent to:
"this string is frozen".freeze # from Ruby 2.2 and earlier

Starting with Ruby 2.5, the same String#-@ method will deduplicate non-frozen strings:

# https://bugs.ruby-lang.org/issues/13077
original = -"this string is frozen"
dynamic = -%w(this string is frozen).join(' ')
# original.object_id == dynamic.object_id

Furthermore, writing -"literal" avoids allocation in the first place in 2.5, just like "literal".freeze since Ruby 2.1. So, if your code only needs to support Ruby 2.3+, you can start using String#-@ and your 2.5 users can benefit from more optimizations without relying on more fragile file-wide (or process-wide) frozen string literals.

However, there’s several places where you do not need to worry about allocations because the VM does it for you!

Hash keys

When given a non-frozen String as a hash key, Ruby transparently duplicates and freezes the key to avoid data corruption in case the original string is mutated [ruby-core:35410].

In the old days, frozen constants were used in some code bases (e.g. mongrel) to reduce overhead from common hash keys. This practice lives on in some places, but is no longer necessary for
the majority of cases. In fact, unnecessarily referencing constants adds some memory overhead in the bytecode for inline caching.

Since Ruby 2.1, using a string literal for Hash#[] and Hash#[]=, and creating hash literals do not allocate new memory for keys.

In other words, there’s no benefit in writing any of the following:

foo = { "key".freeze => nil } # unnecessary freeze
foo["a".freeze] = true        # unnecessary freeze
foo["b".freeze]               # unnecessary freeze

They are equivalent to the following, in all versions of Ruby:

foo = { "key" => nil }
foo["a"] = true
foo["b"]

Note: this optimization does not apply to Hash subclasses.

Furthermore, starting with Ruby 2.5, all untainted Strings used as Hash keys are transparently duplicated to the frozen copy as long as there’s an identical reference to it in the source code.

Unfortunately, this does not help with tainted strings which come from most parsers, yet. But since hardly anybody cares about tainting in keys or at all, I’ve proposed to have it removed in 2.6: https://bugs.ruby-lang.org/issues/14225

case/when statements

Since Ruby 1.9.3, string literals in case/when clauses are transparently frozen and deduplicated since Ruby 2.1: https://bugs.ruby-lang.org/issues/5000

Semi-automatic memory management

(Perhaps a controversial topic)

String#clear exists since Ruby 1.9.1 and immediately releases memory allocated from malloc(3). I use this to reduce memory pressure and improve locality when working with large buffers.
In the C source code of Ruby, you will also find many uses of rb_str_resize(str, 0) to clear buffers.

I don’t know if this can be improved for out-of-the-box Ruby users; and I don’t know how some Rubyists feel about uglifying code to reduce resource usage.

That’s all I can think of for now, thanks for reading.

Footnotes:

3 Likes

Thank you.

Great stuff thanks Eric!

···

On 1/2/18 8:52 PM, Eric Wong wrote:

Hopefully some of you are trying and enjoying Ruby 2.5 by now.
I figured I'd write about some changes to C Ruby over the years
which make it easier to reduce memory use.

Semi-automatic memory management
--------------------------------

(Perhaps a controversial topic)

String#clear exists since Ruby 1.9.1 and immediately releases
memory allocated from malloc(3). I use this to reduce memory
pressure and improve locality when working with large buffers.
In the C source code of Ruby, you will also find many uses of
"rb_str_resize(str, 0)" to clear buffers.

I don't know if this can be improved for out-of-the-box Ruby
users; and I don't know how some Rubyists feel about uglifying
code to reduce resource usage.

Uglifying - gosh! This could be beautified by something like

class String
  def auto_clear
    begin
      result = yield self
      result.equal?(self) ? nil : result # avoid leaking
    ensure
      clear
    end
  end
end

Then people can do this with Strings whose lifetime they know.

expr_returning_string.auto_clear do |str|
  puts "We got #{str}."
end

In fact, the pattern could be generalized. Not sure though how useful this is.

That's all I can think of for now, thanks for reading.

Thank you for the writeup!

Kind regards

robert

···

On Wed, Jan 3, 2018 at 2:52 AM, Eric Wong <e@80x24.org> wrote:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Seems a bit verbose compared to buf.clear... I'd rather go
farther and reduce object counts, too (not just object sizes);
but that requires API changes.

So I'd like to see stuff like readpartial with the second
(outbuf) arg used more. Maybe each_line/gets could gain an
`outbuf' arg, too. But then again, maybe not many people use
readpartial with the `outbuf' arg, either

Anyways, I've been working on a few things the past few days
in stdlib to make things better. At least #14315 and #14320
won't require end users to call String#clear themselves.

···

Robert Klemme <shortcutter@googlemail.com> wrote:

Then people can do this with Strings whose lifetime they know.

expr_returning_string.auto_clear do |str|
  puts "We got #{str}."
end

In fact, the pattern could be generalized. Not sure though how
useful this is.

Wow! Can we expect these optimizations to be released with 2.5.1?

These benchmarks look like very dramatic reductions for memory usage on widely used stdlib modules: care to speculate on any real-world impact, e.g. for rails applications?

I'm also wondering if there are more stdlib modules that could benefit from similar improvements to string buffers, for example csv, json, erb, yaml...

···

On 2018-01-05 15:11, Eric Wong wrote:

Anyways, I've been working on a few things the past few days
in stdlib to make things better. At least #14315 and #14320
won't require end users to call String#clear themselves.

Feature #14268: [PATCH] net/protocol: optimize large read case - Ruby master - Ruby Issue Tracking System
Feature #14315: zlib: reduce garbage on gzip writes (deflate) - Ruby master - Ruby Issue Tracking System
Feature #14319: [PATCH] zlib: reduce garbage on Zlib::GzipReader#readpartial - Ruby master - Ruby Issue Tracking System
Feature #14320: [PATCH] open-uri: clear string after buffering - Ruby master - Ruby Issue Tracking System

> Anyways, I've been working on a few things the past few days
> in stdlib to make things better. At least #14315 and #14320
> won't require end users to call String#clear themselves.
>
> Feature #14268: [PATCH] net/protocol: optimize large read case - Ruby master - Ruby Issue Tracking System
> Feature #14315: zlib: reduce garbage on gzip writes (deflate) - Ruby master - Ruby Issue Tracking System
> Feature #14319: [PATCH] zlib: reduce garbage on Zlib::GzipReader#readpartial - Ruby master - Ruby Issue Tracking System
> Feature #14320: [PATCH] open-uri: clear string after buffering - Ruby master - Ruby Issue Tracking System

Wow! Can we expect these optimizations to be released with 2.5.1?

Unlikely, but zlib is split off and you might be able to upgrade
it independently.

These benchmarks look like very dramatic reductions for memory usage on
widely used stdlib modules: care to speculate on any real-world impact, e.g.
for rails applications?

Sorry, tough to say; it depends on the large strings you're
dealing with. Zlib::GzipWriter is used by Rack::Deflate so
that one might have the biggest impact if you're deflating
in Rack (rather than the reverse proxy).

I've been sprinkling String#clear and reusing strings in my own
code for years, now; but I'm not sure how well it'd be received
on a larger scale. Starting with 2.5, Net::HTTP users can
safely #clear the buffer yielded by read_body; at least.

These uses of String#clear feel like whack-a-mole, though;
and everything in the code path needs to be written in a
memory-aware way to see the big benefits in those tickets.

I'm also wondering if there are more stdlib modules that could benefit from
similar improvements to string buffers, for example csv, json, erb, yaml...

For large buffers, I think clearing the result of File.read
after loading, and clearing the rendered/dumped results
after writing on the user's side should help.

I would also like to see more/better streaming interfaces in
Ruby; things like File.read are scary in the wrong hands and
I prefer the stream-everything mentality of awk/sed in shell
programming.

yaml and json might benefit from small string reductions in
Feature #14225: untaint hash key strings - Ruby master - Ruby Issue Tracking System , too.

···

Andrew Vit <andrew@avit.ca> wrote:

On 2018-01-05 15:11, Eric Wong wrote:

> These benchmarks look like very dramatic reductions for memory usage on
> widely used stdlib modules: care to speculate on any real-world impact, e.g.
> for rails applications?

Fwiw, I would remain pessimistic about real-world visibility.
Bug #13085: io.c io_fwrite creates garbage - Ruby master - Ruby Issue Tracking System was big for me last
year, but it was disappointing that nobody else seemed to notice
the regression.

I wrote:

and everything in the code path needs to be written in a
memory-aware way to see the big benefits in those tickets.

So yeah, I think needs to be a giant shift in mentality
throughout the Ruby world for big improvements to be seen.

I am proud of you guys for not top-posting or (at least most of
you) for not using HTML :slight_smile:

···

Andrew Vit <andrew@avit.ca> wrote:

So yeah, I think needs to be a giant shift in mentality
throughout the Ruby world for big improvements to be seen.

Right, I don't think many rubyists commonly use strings as buffers, and the general trend is towards immutable objects inspired by functional style as well. I think these optimizations can still help if gems/libraries can make use of them beneath application code.

I am proud of you guys for not top-posting or (at least most of
you) for not using HTML :slight_smile:

Haha, it helps to use an old-school mail client!

···

On 2018-01-06 14:17, Eric Wong wrote:

I noticed a big bunch of areas that can use some - love, is anyone
working on it?

https://github.com/ruby/ruby/blob/trunk/lib/uri/common.rb#L343-L358

also the file should have frozen_string_literal: true

···

On Tue, Jan 9, 2018 at 4:44 AM, Andrew Vit <andrew@avit.ca> wrote:

On 2018-01-06 14:17, Eric Wong wrote:

So yeah, I think needs to be a giant shift in mentality
throughout the Ruby world for big improvements to be seen.

Right, I don't think many rubyists commonly use strings as buffers, and the
general trend is towards immutable objects inspired by functional style as
well. I think these optimizations can still help if gems/libraries can make
use of them beneath application code.

I am proud of you guys for not top-posting or (at least most of
you) for not using HTML :slight_smile:

Haha, it helps to use an old-school mail client!

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

I noticed a big bunch of areas that can use some - love, is anyone
working on it?

https://github.com/ruby/ruby/blob/trunk/lib/uri/common.rb#L343-L358

Not at the moment, patches welcome.

also the file should have frozen_string_literal: true

There needs to be a lot more tests written to avoid breakage.
We tried it in the stdlib in a few places and there was a lot
of breakage; so I think we should start with String#-@, first.

···

Sam Saffron <sam.saffron@gmail.com> wrote:

Thanks for sharing it! :wink:

···

On 2018-01-03 02:52, Eric Wong wrote:

That's all I can think of for now, thanks for reading.

--
Ana María Martínez Gómez - ammartinez@suse.de | ammartinez@suse.com
BuildService Engineer
SUSE Linux GmbH, Maxfeldstr. 5, D-90409 Nürnberg
Tel: +49-911-74053-0; Fax: +49-911-7417755; https://www.suse.com/
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard,
   Graham Norton, HRB 21284 (AG Nürnberg)

When given a non-frozen String as a hash key, Ruby transparently
duplicates and freezes the key to avoid data corruption
in case the original string is mutated [ruby-core:35410].

Sort of, I mean it works exactly as designed but has some subtle
issues that may lead to surprises.

x = {}
x["#{1}"] = 1
x["2"] = 1

y = {}
y["#{1}"] = 1
y["2"] = 1

puts x.keys.map{|k| "#{k} #{k.object_id}"}
puts y.keys.map{|k| "#{k} #{k.object_id}"}

1 70113575731380
2 70113575732160
1 70113575731280
2 70113575732160

So here we can see that the string "1" was frozen twice and not de-duped

Hence:

x[-"#{1}"] = 1

Is the optimal thing to do in the current implementation.

I mention this cause I saw similar cases in the common open uri file

Yes, that sucks. I kinda wish we could live with the
small slowdown in bm_so_k_nucleotide.rb for
<Misc #9188: r43870 make benchmark/bm_so_k_nucleotide.rb slow - Ruby master - Ruby Issue Tracking System;
and not reverted the change that deduped all hash keys.

Might be worth investigating again, now that our hash table
is faster.

···

Sam Saffron <sam.saffron@gmail.com> wrote:

Hence:

x[-"#{1}"] = 1

Is the optimal thing to do in the current implementation.