PSA: String memory use reduction techniques


(Eric Wong) #1

Hopefully some of you are trying and enjoying Ruby 2.5 by now.
I figured I'd write about some changes to C Ruby over the years
which make it easier to reduce memory use.

String objects are often to blame for high memory usage in
Ruby applications. High memory usage limits scalability and
hurts performance by increasing memory traffic (GC overhead and
general access times).

Frozen string literals have been proposed as the default in
Ruby 3 but I remain against them for compatibility.
Meanwhile, Ruby has gained some transparent optimizations
along with some syntatic improvements to help programmers
reduce overheads further.

The String#-@ method was introduced back in Ruby 2.3 as
syntactic sugar for making frozen strings more succinctly
than String#freeze:

  # https://bugs.ruby-lang.org/issues/11782
  -"this string is frozen" # became equivalent to:
  "this string is frozen".freeze # from Ruby 2.2 and earlier

Starting with Ruby 2.5, the same String#-@ method will
deduplicate non-frozen strings:

  # https://bugs.ruby-lang.org/issues/13077
  original = -"this string is frozen"
  dynamic = -%w(this string is frozen).join(' ')
  # original.object_id == dynamic.object_id

Furthermore, writing -"literal" avoids allocation in the first
place in 2.5, just like "literal".freeze since Ruby 2.1.
So, if your code only needs to support Ruby 2.3+, you can
start using String#-@ and your 2.5 users can benefit from
more optimizations without relying on more fragile file-wide
(or process-wide) frozen string literals.

However, there's several places where you do not need
to worry about allocations because the VM does it for you!

Hash keys

···

---------

When given a non-frozen String as a hash key, Ruby transparently
duplicates and freezes the key to avoid data corruption
in case the original string is mutated [ruby-core:35410].

In the old days, frozen constants were used in some code bases
(e.g. mongrel) to reduce overhead from common hash keys. This
practice lives on in some places, but is no longer necessary for
the majority of cases. In fact, unnecessarily referencing
constants adds some memory overhead in the bytecode for inline
caching.

Since Ruby 2.1, using a string literal for Hash#[] and Hash#[]=,
and creating hash literals do not allocate new memory for keys.

In other words, there's no benefit in writing any of the
following:

  foo = { "key".freeze => nil } # unnecessary freeze
  foo["a".freeze] = true # unnecessary freeze
  foo["b".freeze] # unnecessary freeze

They are equivalent to the following, in all versions of Ruby:

  foo = { "key" => nil }
  foo["a"] = true
  foo["b"]

Note: this optimization does not apply to Hash subclasses.

Furthermore, starting with Ruby 2.5, all untainted Strings used
as Hash keys are transparently duplicated to the frozen copy as
long as there's an identical reference to it in the source code.

Unfortunately, this does not help with tainted strings which
come from most parsers, yet. But since hardly anybody cares
about tainting in keys or at all, I've proposed to have it
removed in 2.6:

  https://bugs.ruby-lang.org/issues/14225

case/when statements
--------------------

Since Ruby 1.9.3, string literals in case/when clauses are
transparently frozen and deduplicated since Ruby 2.1:

  https://bugs.ruby-lang.org/issues/5000

Semi-automatic memory management
--------------------------------

(Perhaps a controversial topic)

String#clear exists since Ruby 1.9.1 and immediately releases
memory allocated from malloc(3). I use this to reduce memory
pressure and improve locality when working with large buffers.
In the C source code of Ruby, you will also find many uses of
"rb_str_resize(str, 0)" to clear buffers.

I don't know if this can be improved for out-of-the-box Ruby
users; and I don't know how some Rubyists feel about uglifying
code to reduce resource usage.

That's all I can think of for now, thanks for reading.

Footnotes:

* [ruby-core:35410]
  http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/35410
  https://public-inbox.org/ruby-core/?q=core:35410


(Mugurel Chirica) #2

Thank you.


(Tom Copeland) #3

Great stuff thanks Eric!

···

On 1/2/18 8:52 PM, Eric Wong wrote:

Hopefully some of you are trying and enjoying Ruby 2.5 by now.
I figured I'd write about some changes to C Ruby over the years
which make it easier to reduce memory use.


(Robert K.) #4

Semi-automatic memory management
--------------------------------

(Perhaps a controversial topic)

String#clear exists since Ruby 1.9.1 and immediately releases
memory allocated from malloc(3). I use this to reduce memory
pressure and improve locality when working with large buffers.
In the C source code of Ruby, you will also find many uses of
"rb_str_resize(str, 0)" to clear buffers.

I don't know if this can be improved for out-of-the-box Ruby
users; and I don't know how some Rubyists feel about uglifying
code to reduce resource usage.

Uglifying - gosh! This could be beautified by something like

class String
  def auto_clear
    begin
      result = yield self
      result.equal?(self) ? nil : result # avoid leaking
    ensure
      clear
    end
  end
end

Then people can do this with Strings whose lifetime they know.

expr_returning_string.auto_clear do |str|
  puts "We got #{str}."
end

In fact, the pattern could be generalized. Not sure though how useful this is.

That's all I can think of for now, thanks for reading.

Thank you for the writeup!

Kind regards

robert

···

On Wed, Jan 3, 2018 at 2:52 AM, Eric Wong <e@80x24.org> wrote:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/


(Eric Wong) #5

Seems a bit verbose compared to buf.clear... I'd rather go
farther and reduce object counts, too (not just object sizes);
but that requires API changes.

So I'd like to see stuff like readpartial with the second
(outbuf) arg used more. Maybe each_line/gets could gain an
`outbuf' arg, too. But then again, maybe not many people use
readpartial with the `outbuf' arg, either

Anyways, I've been working on a few things the past few days
in stdlib to make things better. At least #14315 and #14320
won't require end users to call String#clear themselves.

https://bugs.ruby-lang.org/issues/14268
https://bugs.ruby-lang.org/issues/14315
https://bugs.ruby-lang.org/issues/14319
https://bugs.ruby-lang.org/issues/14320

···

Robert Klemme <shortcutter@googlemail.com> wrote:

Then people can do this with Strings whose lifetime they know.

expr_returning_string.auto_clear do |str|
  puts "We got #{str}."
end

In fact, the pattern could be generalized. Not sure though how
useful this is.


(Andrew Vit) #6

Wow! Can we expect these optimizations to be released with 2.5.1?

These benchmarks look like very dramatic reductions for memory usage on widely used stdlib modules: care to speculate on any real-world impact, e.g. for rails applications?

I'm also wondering if there are more stdlib modules that could benefit from similar improvements to string buffers, for example csv, json, erb, yaml...

···

On 2018-01-05 15:11, Eric Wong wrote:

Anyways, I've been working on a few things the past few days
in stdlib to make things better. At least #14315 and #14320
won't require end users to call String#clear themselves.

https://bugs.ruby-lang.org/issues/14268
https://bugs.ruby-lang.org/issues/14315
https://bugs.ruby-lang.org/issues/14319
https://bugs.ruby-lang.org/issues/14320


(Eric Wong) #7

> Anyways, I've been working on a few things the past few days
> in stdlib to make things better. At least #14315 and #14320
> won't require end users to call String#clear themselves.
>
> https://bugs.ruby-lang.org/issues/14268
> https://bugs.ruby-lang.org/issues/14315
> https://bugs.ruby-lang.org/issues/14319
> https://bugs.ruby-lang.org/issues/14320

Wow! Can we expect these optimizations to be released with 2.5.1?

Unlikely, but zlib is split off and you might be able to upgrade
it independently.

These benchmarks look like very dramatic reductions for memory usage on
widely used stdlib modules: care to speculate on any real-world impact, e.g.
for rails applications?

Sorry, tough to say; it depends on the large strings you're
dealing with. Zlib::GzipWriter is used by Rack::Deflate so
that one might have the biggest impact if you're deflating
in Rack (rather than the reverse proxy).

I've been sprinkling String#clear and reusing strings in my own
code for years, now; but I'm not sure how well it'd be received
on a larger scale. Starting with 2.5, Net::HTTP users can
safely #clear the buffer yielded by read_body; at least.

These uses of String#clear feel like whack-a-mole, though;
and everything in the code path needs to be written in a
memory-aware way to see the big benefits in those tickets.

I'm also wondering if there are more stdlib modules that could benefit from
similar improvements to string buffers, for example csv, json, erb, yaml...

For large buffers, I think clearing the result of File.read
after loading, and clearing the rendered/dumped results
after writing on the user's side should help.

I would also like to see more/better streaming interfaces in
Ruby; things like File.read are scary in the wrong hands and
I prefer the stream-everything mentality of awk/sed in shell
programming.

yaml and json might benefit from small string reductions in
https://bugs.ruby-lang.org/issues/14225 , too.

···

Andrew Vit <andrew@avit.ca> wrote:

On 2018-01-05 15:11, Eric Wong wrote:


(Eric Wong) #8

> These benchmarks look like very dramatic reductions for memory usage on
> widely used stdlib modules: care to speculate on any real-world impact, e.g.
> for rails applications?

Fwiw, I would remain pessimistic about real-world visibility.
https://bugs.ruby-lang.org/issues/13085 was big for me last
year, but it was disappointing that nobody else seemed to notice
the regression.

I wrote:

and everything in the code path needs to be written in a
memory-aware way to see the big benefits in those tickets.

So yeah, I think needs to be a giant shift in mentality
throughout the Ruby world for big improvements to be seen.

I am proud of you guys for not top-posting or (at least most of
you) for not using HTML :slight_smile:

···

Andrew Vit <andrew@avit.ca> wrote:


(Andrew Vit) #9

So yeah, I think needs to be a giant shift in mentality
throughout the Ruby world for big improvements to be seen.

Right, I don't think many rubyists commonly use strings as buffers, and the general trend is towards immutable objects inspired by functional style as well. I think these optimizations can still help if gems/libraries can make use of them beneath application code.

I am proud of you guys for not top-posting or (at least most of
you) for not using HTML :slight_smile:

Haha, it helps to use an old-school mail client!

···

On 2018-01-06 14:17, Eric Wong wrote:


#10

I noticed a big bunch of areas that can use some - love, is anyone
working on it?

also the file should have frozen_string_literal: true

···

On Tue, Jan 9, 2018 at 4:44 AM, Andrew Vit <andrew@avit.ca> wrote:

On 2018-01-06 14:17, Eric Wong wrote:

So yeah, I think needs to be a giant shift in mentality
throughout the Ruby world for big improvements to be seen.

Right, I don't think many rubyists commonly use strings as buffers, and the
general trend is towards immutable objects inspired by functional style as
well. I think these optimizations can still help if gems/libraries can make
use of them beneath application code.

I am proud of you guys for not top-posting or (at least most of
you) for not using HTML :slight_smile:

Haha, it helps to use an old-school mail client!

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>


(Eric Wong) #11

I noticed a big bunch of areas that can use some - love, is anyone
working on it?

https://github.com/ruby/ruby/blob/trunk/lib/uri/common.rb#L343-L358

Not at the moment, patches welcome.

also the file should have frozen_string_literal: true

There needs to be a lot more tests written to avoid breakage.
We tried it in the stdlib in a few places and there was a lot
of breakage; so I think we should start with String#-@, first.

···

Sam Saffron <sam.saffron@gmail.com> wrote:


(ammartinez) #12

Thanks for sharing it! :wink:

···

On 2018-01-03 02:52, Eric Wong wrote:

That's all I can think of for now, thanks for reading.

--
Ana María Martínez Gómez - ammartinez@suse.de | ammartinez@suse.com
BuildService Engineer
SUSE Linux GmbH, Maxfeldstr. 5, D-90409 N√ľrnberg
Tel: +49-911-74053-0; Fax: +49-911-7417755; https://www.suse.com/
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard,
¬†¬†¬†Graham Norton, HRB 21284 (AG N√ľrnberg)


#13

When given a non-frozen String as a hash key, Ruby transparently
duplicates and freezes the key to avoid data corruption
in case the original string is mutated [ruby-core:35410].

Sort of, I mean it works exactly as designed but has some subtle
issues that may lead to surprises.

x = {}
x["#{1}"] = 1
x["2"] = 1

y = {}
y["#{1}"] = 1
y["2"] = 1

puts x.keys.map{|k| "#{k} #{k.object_id}"}
puts y.keys.map{|k| "#{k} #{k.object_id}"}

1 70113575731380
2 70113575732160
1 70113575731280
2 70113575732160

So here we can see that the string "1" was frozen twice and not de-duped

Hence:

x[-"#{1}"] = 1

Is the optimal thing to do in the current implementation.

I mention this cause I saw similar cases in the common open uri file


(Eric Wong) #14

Yes, that sucks. I kinda wish we could live with the
small slowdown in bm_so_k_nucleotide.rb for
<https://bugs.ruby-lang.org/issues/9188>
and not reverted the change that deduped all hash keys.

Might be worth investigating again, now that our hash table
is faster.

···

Sam Saffron <sam.saffron@gmail.com> wrote:

Hence:

x[-"#{1}"] = 1

Is the optimal thing to do in the current implementation.