A memcached-like server in Ruby - feasible?

Hi group,

I'm running a very high-load website done in Rails.

The number and duration of queries per-page is killing us. So we're
thinking of using a caching layer like memcached. Except we'd like
something more sophisticated than memcached.

Allow me to explain.

memcached is like an object, with a very limited API: basically
#get_value_by_key and #set_value_by_key.

One thing we need, that isn't supported by memcached, is to be able to
store a large set of very large objects, and then retrieve only a few
of them by certain parameters. For example, we may want to store 100K
Foo instances, and retrieve only the first 20 - sorted by their
#created_on attribute - whose #bar attribute equal 23.

We could store all those 100K Foo instances normally on the memcached
server, and let the Rails process retrieve them on each request. Then
the process could perform the filtering itself. Problem is that it's
very suboptimal, because we'd have to transfer a lot of data to each
process on each request, and very little of that data is actually
needed after the processing. I.e. we would pass 100K large objects,
while the process only really needs 20 of them.

Ideally, we could call:

memcached_improved.fetch_newest( :attributes => { :bar => 23 }, :limit
=> 20 )

and have the improved_memcached server filter and return only the
required 20 objects by itself.

Now the question is:

How expensive would it be to write memcached_improved?

On the surface, this might seem easy to do with something like
Daemons[1] in Ruby (as most of our programmers are Rubyists). Just
write a simple class, have it run a TCP server and respond to
requests. Yet I'm sure it's not that simple, otherwise memcached would
have been trivial to write. There are probably stability issues for
multiple concurrent clients, multiple simultaneous read/write requests
(race conditions etc.) and heavy loads.

So, what do you think:

1) How would you approach the development of memcached_improved?

2) Is this task doable in Ruby? Or maybe only a Ruby + X combination
(X probably being C)?

3) How much time / effort / people / expertise should such a task
require? Is it feasible for a smallish team (~4 programmers) to put
together as a side-project over a couple of weeks?

Thanks,
-Tom

···

--
[1] http://daemons.rubyforge.org/

Tom Machinski wrote:

Hi group,

I'm running a very high-load website done in Rails.

The number and duration of queries per-page is killing us. So we're
thinking of using a caching layer like memcached. Except we'd like
something more sophisticated than memcached.

Allow me to explain.

memcached is like an object, with a very limited API: basically
#get_value_by_key and #set_value_by_key.

One thing we need, that isn't supported by memcached, is to be able to
store a large set of very large objects, and then retrieve only a few
of them by certain parameters. For example, we may want to store 100K
Foo instances, and retrieve only the first 20 - sorted by their
#created_on attribute - whose #bar attribute equal 23.
  
It looks like the job a database would do for you. Retrieving 20 large
objects with such conditions should be a piece of cake for any properly
tuned database. Did you try this with PostgreSQL or MySQL with indexes
on created_on and bar? How much memory did you give your database to
play with ? If the size of the objects is so bad it takes too much time
to extract from the DB (or the trafic is too much for the DB to use its
own disk cache efficiently) you could only retrieve the ids in the first
pass with hand-crafted SQL and then fetch the whole objects using
memcache (and only go to the DB if memcache doesn't have the object you
are looking for).

Lionel.

<snip>

i'm reading this as

   - need query
   - need readonly
   - need sorting
   - need fast
   - need server

and thinking: how isn't this a readonly slave database? i think that mysql can either do this with a readonly slave *or* it cannot be done with modest resources.

my 2cts.

a @ http://codeforpeople.com/

···

On Oct 27, 2007, at 4:31 PM, Tom Machinski wrote:

Hi group,

I'm running a very high-load website done in Rails.

The number and duration of queries per-page is killing us. So we're
thinking of using a caching layer like memcached. Except we'd like
something more sophisticated than memcached.

Allow me to explain.

memcached is like an object, with a very limited API: basically
#get_value_by_key and #set_value_by_key.

One thing we need, that isn't supported by memcached, is to be able to
store a large set of very large objects, and then retrieve only a few
of them by certain parameters. For example, we may want to store 100K
Foo instances, and retrieve only the first 20 - sorted by their
#created_on attribute - whose #bar attribute equal 23.

We could store all those 100K Foo instances normally on the memcached
server, and let the Rails process retrieve them on each request. Then
the process could perform the filtering itself. Problem is that it's
very suboptimal, because we'd have to transfer a lot of data to each
process on each request, and very little of that data is actually
needed after the processing. I.e. we would pass 100K large objects,
while the process only really needs 20 of them.

--
it is not enough to be compassionate. you must act.
h.h. the 14th dalai lama

I created a very simple DRb based cache server some time go for one of my
projects; essentially to share objects across web nodes, but also to cache
objects (persistent and temporary): http://boogaloo.rubyforge.org/

I don't recommend you use this project, I haven't used it myself for quite a
while and it has a number of issues I haven't addressed. You may find it a
helpful basis implementation if you should decided to go the pure Ruby
route.

Regards
Ian

···

On 27/10/2007, Tom Machinski <tom.machinski@gmail.com> wrote:

Hi group,

I'm running a very high-load website done in Rails.

The number and duration of queries per-page is killing us. So we're
thinking of using a caching layer like memcached. Except we'd like
something more sophisticated than memcached.

Allow me to explain.

memcached is like an object, with a very limited API: basically
#get_value_by_key and #set_value_by_key.

One thing we need, that isn't supported by memcached, is to be able to
store a large set of very large objects, and then retrieve only a few
of them by certain parameters. For example, we may want to store 100K
Foo instances, and retrieve only the first 20 - sorted by their
#created_on attribute - whose #bar attribute equal 23.

We could store all those 100K Foo instances normally on the memcached
server, and let the Rails process retrieve them on each request. Then
the process could perform the filtering itself. Problem is that it's
very suboptimal, because we'd have to transfer a lot of data to each
process on each request, and very little of that data is actually
needed after the processing. I.e. we would pass 100K large objects,
while the process only really needs 20 of them.

Ideally, we could call:

memcached_improved.fetch_newest( :attributes => { :bar => 23 }, :limit
=> 20 )

and have the improved_memcached server filter and return only the
required 20 objects by itself.

Now the question is:

How expensive would it be to write memcached_improved?

On the surface, this might seem easy to do with something like
Daemons[1] in Ruby (as most of our programmers are Rubyists). Just
write a simple class, have it run a TCP server and respond to
requests. Yet I'm sure it's not that simple, otherwise memcached would
have been trivial to write. There are probably stability issues for
multiple concurrent clients, multiple simultaneous read/write requests
(race conditions etc.) and heavy loads.

So, what do you think:

1) How would you approach the development of memcached_improved?

2) Is this task doable in Ruby? Or maybe only a Ruby + X combination
(X probably being C)?

3) How much time / effort / people / expertise should such a task
require? Is it feasible for a smallish team (~4 programmers) to put
together as a side-project over a couple of weeks?

Thanks,
-Tom
--
[1] http://daemons.rubyforge.org/

On Sun, Oct 28, 2007 at 07:31:30AM +0900 Tom Machinski mentioned:

2) Is this task doable in Ruby? Or maybe only a Ruby + X combination
(X probably being C)?

I beleive, you can achive a high efficiency in server design by using
event-driven design. There're some event libraries for ruby available,
e.g. eventmachine. In this case the scalability of the server should
be comparable with the C version.

Thread will have a huge overhead in case of many clients.

BTW, the original memcached uses event-driven design too, IIRC.

···

--
Stanislav Sedov
ST4096-RIPE

What I would try is using a slave to replicate just the tables you
need (actually the indexes if that were possible) and memcached to
keep copies of all those objects. I've been using memcached for years
and I can swear by it. But keeping indexes in memcached is not easy/
reliable to do and mysql would do a better job. So then you would
query the slave DB for the conditions you need but only to return the
ids. And then you would ask memcached for those objects. I've been
doing something similar in my CMS and it has worked great for me. Here
is an article that might explain better where I'm coming from [1]. And
if mysql clusters make you feel a little dizzy simple slave
replication and mysql-proxy [2] might help out too.

Hope it helps,

Adrian Madrid

[1] http://blog.methodmissing.com/2007/4/24/partially-bypass-activerecord-instantiation-when-using-memcached/
[2] http://forge.mysql.com/wiki/MySQL_Proxy

···

On Oct 27, 4:31 pm, Tom Machinski <tom.machin...@gmail.com> wrote:

Hi group,

I'm running a very high-load website done in Rails.

The number and duration of queries per-page is killing us. So we're
thinking of using a caching layer like memcached. Except we'd like
something more sophisticated than memcached.

Allow me to explain.

memcached is like an object, with a very limited API: basically
#get_value_by_key and #set_value_by_key.

One thing we need, that isn't supported by memcached, is to be able to
store a large set of very large objects, and then retrieve only a few
of them by certain parameters. For example, we may want to store 100K
Foo instances, and retrieve only the first 20 - sorted by their
#created_on attribute - whose #bar attribute equal 23.

We could store all those 100K Foo instances normally on the memcached
server, and let the Rails process retrieve them on each request. Then
the process could perform the filtering itself. Problem is that it's
very suboptimal, because we'd have to transfer a lot of data to each
process on each request, and very little of that data is actually
needed after the processing. I.e. we would pass 100K large objects,
while the process only really needs 20 of them.

Ideally, we could call:

memcached_improved.fetch_newest( :attributes => { :bar => 23 }, :limit
=> 20 )

and have the improved_memcached server filter and return only the
required 20 objects by itself.

Now the question is:

How expensive would it be to write memcached_improved?

On the surface, this might seem easy to do with something like
Daemons[1] in Ruby (as most of our programmers are Rubyists). Just
write a simple class, have it run a TCP server and respond to
requests. Yet I'm sure it's not that simple, otherwise memcached would
have been trivial to write. There are probably stability issues for
multiple concurrent clients, multiple simultaneous read/write requests
(race conditions etc.) and heavy loads.

So, what do you think:

1) How would you approach the development of memcached_improved?

2) Is this task doable in Ruby? Or maybe only a Ruby + X combination
(X probably being C)?

3) How much time / effort / people / expertise should such a task
require? Is it feasible for a smallish team (~4 programmers) to put
together as a side-project over a couple of weeks?

Thanks,
-Tom
--
[1]http://daemons.rubyforge.org/

why don't you try Gemstone or other object-oriented databases?

besides, memcache isn't THAT much faster then database, it's faster becouse it can store objects in memory, but if you need queries it looses all it's advantages.

greets

It looks like the job a database would do for you. Retrieving 20 large
objects with such conditions should be a piece of cake for any properly
tuned database. Did you try this with PostgreSQL or MySQL with indexes
on created_on and bar?

Yes, I'm using MySQL 5, and all query columns are indexed.

How much memory did you give your database to
play with?

Not sure right now, I'll ask my admin and reply.

If the size of the objects is so bad it takes too much time
to extract from the DB (or the trafic is too much for the DB to use its
own disk cache efficiently) you could only retrieve the ids in the first
pass with hand-crafted SQL and then fetch the whole objects using
memcache (and only go to the DB if memcache doesn't have the object you
are looking for).

Might be a good idea.

Lionel.

Long term, my goal is to minimize the amount of queries that hit the
database. Some of the queries are more complex than the relatively
simple example I've given here. And I don't think I could optimize
them much beyond 0.01 secs per query.

I was hoping to alleviate with memcached_improved some of the pains
associated with database scaling, e.g. building a replicating cluster
etc. Basically what memcached does for you, except as demonstrated,
memcached by itself seems insufficient for our needs.

Thanks,
-Tom

···

On 10/28/07, Lionel Bouton <lionel-subscription@bouton.name> wrote:

ara.t.howard wrote:

Hi group,

I'm running a very high-load website done in Rails.

The number and duration of queries per-page is killing us. So we're
thinking of using a caching layer like memcached. Except we'd like
something more sophisticated than memcached.

Allow me to explain.

memcached is like an object, with a very limited API: basically
#get_value_by_key and #set_value_by_key.

One thing we need, that isn't supported by memcached, is to be able to
store a large set of very large objects, and then retrieve only a few
of them by certain parameters. For example, we may want to store 100K
Foo instances, and retrieve only the first 20 - sorted by their
#created_on attribute - whose #bar attribute equal 23.

We could store all those 100K Foo instances normally on the memcached
server, and let the Rails process retrieve them on each request. Then
the process could perform the filtering itself. Problem is that it's
very suboptimal, because we'd have to transfer a lot of data to each
process on each request, and very little of that data is actually
needed after the processing. I.e. we would pass 100K large objects,
while the process only really needs 20 of them.

<snip>

i'm reading this as

  - need query
  - need readonly
  - need sorting
  - need fast
  - need server

and thinking: how isn't this a readonly slave database? i think that mysql can either do this with a readonly slave *or* it cannot be done with modest resources.

my 2cts.

Add "large set of very large (binary?) objects". So ... yes, at least *one* database/server. This is exactly the sort of thing you *can* throw hardware at. I guess I'd pick PostgreSQL over MySQL for something like that, but unless you're a billionaire, I'd be doing it from disk and not from RAM. RAM-based "databases" look really attractive on paper, but they tend to look better than they really are for a lot of reasons:

1. *Good* RAM -- the kind that doesn't fall over in a ragged heap when challenged with "memtest86" -- is not inexpensive. Let's say the objects are "very large" -- how about a typical CD length of 700 MB? OK ... too big -- how about a three minute video highly compressed. How big are those puppies? Let's assume a megabyte. 100K of those is 100 GB. Wanna price 100 GB of *good* RAM? Even with compression, it doesn't take much stuff to fill up a 160 GB iPod, right?

2. A good RDBMS design / query planner is amazingly intelligent, and you can give it hints. It might take you a couple of weeks to build your indexes but your queries will run fast afterwards.

3. RAID 10 is your friend. Mirroring preserves your data when a disk dies, and striping makes it come into RAM quickly.

4. Enterprise-grade SANs have lots of buffering built in. And for that stuff, you don't have to be a billionaire -- just a plain old millionaire.

"Premature optimization is the root of all evil?" Bullshit! :slight_smile:

···

On Oct 27, 2007, at 4:31 PM, Tom Machinski wrote:

The problem is that for a perfectly normalized database, those queries
are *heavy*.

We're using straight, direct SQL (no ActiveRecord calls) there, and
several DBAs have already looked into our query strategy. Bottom line
is that each query on the normalized database is non-trivial, and they
can't reduce it to less than 0.2 secs / query. As we have 5+ of these
queries per page, we'd need one MySQL server for every
request-per-second we want to serve. As we need at least 50 reqs/sec,
we'd need 50 MySQL servers (and probably something similar in terms of
web servers). We can't afford that.

We can only improve the queries TTC by replicating data inside the
database, i.e. de-normalizing it with internal caching at the table
level (basically, that amounts to replicating certain columns from
table `bars` in table `foos`, thus saving some very heavy JOINs).

But if we're already de-normalizing, caching and replicating data, we
might as well create another layer of de-normalized, processed data
between the database and the Rails servers. That way, we will need
less MySQL servers, output requests faster (as the layer would hold
the data in an already processed state), and save a much of the
replication / clustering overhead.

-Tom

···

On 10/28/07, ara.t.howard <ara.t.howard@gmail.com> wrote:

i'm reading this as

   - need query
   - need readonly
   - need sorting
   - need fast
   - need server

and thinking: how isn't this a readonly slave database? i think that
mysql can either do this with a readonly slave *or* it cannot be done
with modest resources.

Thanks, Ian!

Would you mind sharing - here, or by linking a blog / text, or
privately if you prefer - some information about these issues?

I'm asking for two reasons:

1) To learn about possible pitfalls / complications / costs involved
in taking the pure Ruby route.

2) We may decide to adopt your project and try to address those issues
to use the patched Boogaloo in production.

Thanks,
-Tom

···

On 10/28/07, Ian Leitch <port001@gmail.com> wrote:

I don't recommend you use this project, I haven't used it myself for quite a
while and it has a number of issues I haven't addressed. You may find it a
helpful basis implementation if you should decided to go the pure Ruby
route.

I beleive, you can achive a high efficiency in server design by using
event-driven design. There're some event libraries for ruby available,
e.g. eventmachine. In this case the scalability of the server should
be comparable with the C version.

Thread will have a huge overhead in case of many clients.

BTW, the original memcached uses event-driven design too, IIRC.

Yes, memcached (including latest) uses libevent.

I'm not completely sure whether a production-grade server of this sort
is feasible in Ruby. Many people, both here and elsewhere, seem to
think it should be done in C for better stability / efficiency /
resource consumption.

Thanks,
-Tom

···

On 10/28/07, Stanislav Sedov <stas@freebsd.org> wrote:

--
Stanislav Sedov
ST4096-RIPE

Marcin Raczkowski wrote:

why don't you try Gemstone or other object-oriented databases?

besides, memcache isn't THAT much faster then database, it's faster becouse it can store objects in memory, but if you need queries it looses all it's advantages.

greets

here's simple object oriented DB i wrote ages ago, just throw out state machine stuff and have phun, searching is done in ruby and really easy. it took me one day if i remember correctly so that should be indication of time you'd need to make simple ODB

# Module responsible for handling requests informations

···

#
  # Information status:
  # - :waiting (waiting for server response)
  # - :progress (server reported progress)
  # - :results (server returned results)
  # - :error (j.w)
  # - :timeout (request timed out)
  # - :collect - to be garbage collected (right now for debuging purposes)
  module Informations
   # default time to live of message (used when expire is set to :ttl)
   attr_accessor :default_ttl
   # default timeout - time betwen ANY actions sent by server
   attr_accessor :default_timeout
   # use garbage collecting?
   attr_accessor :gc

   def init(ttl=30, timeout=30, gc=true)
     @gc = gc
     @default_ttl = ttl
     @default_timeout = timeout
     @informations={}
   end

   # creates new informations about request, id is request id,
   # hash should contain additional informations (the'll be merged)
   def new_info(id, hash)
     #hash=hash.dup
     #hash.delete(:data)
     info={}
     info[:id]=id
     info[:status]=:waiting
     info[:timeout]=@default_timeout
     info[:last_action]=info[:start]=Time.now
     info[:expire]=:new
     info[:ttl]=@default_ttl
     info.merge! hash

     @informations[id] = info
   end

   # information state machine
   # checks message status - and takes care of checking state transitions
   # if transition is wrong it's ignored (no exception is rised!!)
   #
   # new info is returned
   def change_status(info, state)
     case info[:status]
       when :waiting, :progress
         if [:progress, :results, :error, :timeout].include? state
           info[:status]=state
           info[:stop]=Time.now unless state == :progress
           info[:last_action]=Time.now
         end
       when :results, :error, :timeout
         if state == :collect
           info[:status]=state
           info[:last_action]=Time.now
         end
     end
     info
   end

   # checks if message timed out
   def timeout?(info)
     change_status(info, :timeout) if ([:wait, :progress].include? info[:status]) && (Time.now > info[:last_action] + info[:timeout])
   end

   # finds information with id
   #
   # takes care of marking msg, as timed out/ to be collected
   # returns info
   def find(id)
     self.timeout?(@informations[id])

     begin
       info = @informations[id].dup

       #return nil if info[:state]==:collect # don't return expired infos
       if info[:expire]==:first
         @gc ? change_status(@informations[id], :collect) : @informations.delete(id)
       end
       if (info[:expire]==:ttl) && (Time.now < info[:last_action] + info[:ttl])
         @gc ? change_status(@informations[id], :collect) : @informations.delete(id)
       end

     rescue Exception
       info=nil
     end

     #info[:last_action]=Time.now preventing expire ?
     info
   end

   # finds all message matching criteria block
   # or checks if :server_id and :name provided in hash match
   #
   # block should return true if information should be returned
   #
   # Examples:
   # find_all({:name=>"actions", :server_id=>"121516136171356151"})
   # find_all() {|i| i[:last_action] > Time.now-60 }
   # returns all informations that state changed minute or less ago
   # find_all() {|i| i[:status]==:error}
   # returns all messages that returned errors
   # gc! if find_all() {|i| i[:status]==:collect}.size > 1000
   # clears old messages when there's more then 1000 of them
   def find_all(hash={})
     res =
     @informations.each_pair { |k,v|
       if block_given?
         res << self.find(k) if yield(v.dup)
       else
         catch(:no_match) {
           # add more here!!
           [:server_id, :name].each { |x|
             throw(:no_match) if hash && hash!=v
           }
           res << self.find(k)
         }
       end
     }
     res.empty? ? nil : res
   end

   # clears all messages marked for collection
   def gc!
     @informations.each_pair { |k,v|
       @informations.delete(k) if v[:status]==:collect
     }
   end
  end

"Tom Machinski" <tom.machinski@gmail.com> writes:

Long term, my goal is to minimize the amount of queries that hit the
database. Some of the queries are more complex than the relatively
simple example I've given here. And I don't think I could optimize
them much beyond 0.01 secs per query.

I was hoping to alleviate with memcached_improved some of the pains
associated with database scaling, e.g. building a replicating cluster
etc. Basically what memcached does for you, except as demonstrated,
memcached by itself seems insufficient for our needs.

The other thing you can play with is using sqlite as the local (one
per app server) cache engine.

YS.

http://drawohara.tumblr.com/post/17471102

couldn't resist...

cheers.

a @ http://codeforpeople.com/

···

On Oct 28, 2007, at 12:48 AM, M. Edward (Ed) Borasky wrote:

it doesn't take much stuff to fill up a 160 GB iPod, right?

--
share your knowledge. it's a way to achieve immortality.
h.h. the 14th dalai lama

A few weeks ago, I had 2 1GB RAM modules, which were fine with running
memtest86 over the weekend. But I could not get gcc 4.1.1 to compile
itself while they were present. The error message even hinted at
using defective hardware. After exchanging them, it worked. So
nowadays, I prefer compiling gcc to memtest86.

s.

···

On Sun, 28 Oct 2007 15:48:11 +0900 "M. Edward (Ed) Borasky" <znmeb@cesmail.net> wrote:

1. *Good* RAM -- the kind that doesn't fall over in a ragged heap when
challenged with "memtest86" -- is not inexpensive.

Tom Machinski wrote:

···

On 10/28/07, ara.t.howard <ara.t.howard@gmail.com> wrote:

with modest resources.

The problem is that for a perfectly normalized database, those queries
are *heavy*.

We're using straight, direct SQL (no ActiveRecord calls) there, and
several DBAs have already looked into our query strategy. Bottom line
is that each query on the normalized database is non-trivial, and they
can't reduce it to less than 0.2 secs / query.

Try enabling the MySQL query cache. For many applications even a few MB
can work wonders.
--
Posted via http://www.ruby-forum.com/\.

It's close to a year since I last worked on Boogaloo, so please forgive my
poor memory.

* A thread pool is used to ensure (to some very basic degree) that the
server doesn't get overloaded quite so easily. The problem with this is that
DRb handles each connection in a new thread, if you had a long running
operation and many new connection attempts being made, an internal DRb limit
may be hit and connections may be dropped. I never looked into this, you may
want to remove the thread pool entirely.

* The cache implementations aren't thread safe. I left this out deliberately
to avoid a performance hit with high lock contention, though in hindsight
I'd call this a premature optimization. I'd recommend you make this thread
safe before using it in production; race conditions are going to cause you
much more of a headache than scalability issues. If you ever reach the point
where lock contention is an issue, you could probably split the namespaces
into chunks and protect each with separate locks.

* You mentioned you want to store and retrieve a large number of objects. I
only ever used a small number of or single objects with Boogaloo. You may
want to do some quick and dirty tests to see how much Ruby's marshaling
costs for large numbers of objects.

You'll no doubt find other issues if you decided to use Boogaloo as a basis.
As I said, I've not used it in over a year, and for someone that has been
using Ruby for less than two, I wouldn't like to put Boogaloo forward as an
example of what I'm currently capable :wink:

···

On 28/10/2007, Tom Machinski <tom.machinski@gmail.com> wrote:

On 10/28/07, Ian Leitch <port001@gmail.com> wrote:
> I don't recommend you use this project, I haven't used it myself for
quite a
> while and it has a number of issues I haven't addressed. You may find it
a
> helpful basis implementation if you should decided to go the pure Ruby
> route.

Thanks, Ian!

Would you mind sharing - here, or by linking a blog / text, or
privately if you prefer - some information about these issues?

I'm asking for two reasons:

1) To learn about possible pitfalls / complications / costs involved
in taking the pure Ruby route.

2) We may decide to adopt your project and try to address those issues
to use the patched Boogaloo in production.

Thanks,
-Tom

Add "large set of very large (binary?) objects". So ... yes, at least
*one* database/server. This is exactly the sort of thing you *can* throw
hardware at. I guess I'd pick PostgreSQL over MySQL for something like
that, but unless you're a billionaire, I'd be doing it from disk and not
from RAM. RAM-based "databases" look really attractive on paper, but
they tend to look better than they really are for a lot of reasons:

1. *Good* RAM -- the kind that doesn't fall over in a ragged heap when
challenged with "memtest86" -- is not inexpensive. Let's say the objects
are "very large" -- how about a typical CD length of 700 MB? OK ... too
big -- how about a three minute video highly compressed. How big are
those puppies? Let's assume a megabyte. 100K of those is 100 GB. Wanna
price 100 GB of *good* RAM? Even with compression, it doesn't take much
stuff to fill up a 160 GB iPod, right?

I might have impressed you with a somewhat inflated view of how large
our data-set is :slight_smile:

We have about 100K objects, occupying ~500KB per object. So all in
all, the total weight of our dataset is no more than 500MBs. We might
grow to maybe twice that in the next 2 years. But that's it.

So it's very feasible to keep the entire data-set in *good* RAM for a
reasonable cost.

2. A good RDBMS design / query planner is amazingly intelligent, and you
can give it hints. It might take you a couple of weeks to build your
indexes but your queries will run fast afterwards.

Good point. Unfortunately, MySQL 5 doesn't appear to be able to take
hints. We've analyzed our queries and there's some strategies there we
could definitely improve by manual hinting, but alas we'd need to
switch to an RDBMS that supports those.

3. RAID 10 is your friend. Mirroring preserves your data when a disk
dies, and striping makes it come into RAM quickly.

4. Enterprise-grade SANs have lots of buffering built in. And for that
stuff, you don't have to be a billionaire -- just a plain old millionaire.

We had some bad experience with a poor SAN setup, though we might have
been victims of improper installation.

Thanks,
-Tom

···

On 10/28/07, M. Edward (Ed) Borasky <znmeb@cesmail.net> wrote:

Tom Machinski wrote:

···

On 10/28/07, ara.t.howard <ara.t.howard@gmail.com> wrote:

i'm reading this as

   - need query
   - need readonly
   - need sorting
   - need fast
   - need server

and thinking: how isn't this a readonly slave database? i think that
mysql can either do this with a readonly slave *or* it cannot be done
with modest resources.

The problem is that for a perfectly normalized database, those queries
are *heavy*.

We're using straight, direct SQL (no ActiveRecord calls) there, and
several DBAs have already looked into our query strategy. Bottom line
is that each query on the normalized database is non-trivial, and they
can't reduce it to less than 0.2 secs / query. As we have 5+ of these
queries per page, we'd need one MySQL server for every
request-per-second we want to serve. As we need at least 50 reqs/sec,
we'd need 50 MySQL servers (and probably something similar in terms of
web servers). We can't afford that.

We can only improve the queries TTC by replicating data inside the
database, i.e. de-normalizing it with internal caching at the table
level (basically, that amounts to replicating certain columns from
table `bars` in table `foos`, thus saving some very heavy JOINs).

But if we're already de-normalizing, caching and replicating data, we
might as well create another layer of de-normalized, processed data
between the database and the Rails servers. That way, we will need
less MySQL servers, output requests faster (as the layer would hold
the data in an already processed state), and save a much of the
replication / clustering overhead.

-Tom

MapReduce and Starfish?