Beyond YAML? (scaling)

Jeremy_Hinegardner · 7 May 2007 04:16

Okay, so I didn't get to it this weekend, but it is an interesting
project, but can you explain a bit more on the data requirments? I have
some questions inlined.

Hi,

Jeremy Hinegardner wrote:
>
>If you want to describe your data needs a bit, and what operations you
>need to operate on it, I'll be happy to play around with an ruby/sqlite3
>program and see what pops out.

I've created a small tolerance DSL, and coupled with the Monte Carlo
Method[1] and the Pearson Correlation Coefficient[2], I'm performing
sensitivity analysis[2] on some of the simulation codes used for our
Orion vehicle[3]. In other words, jiggle the inputs, and see how
sensitive the outputs are and which inputs are the most influential.

The current system[5] works, and after the YAML->Marshal migration,
it scales well enough for now. The trouble is the entire architecture
is wrong if I want to monitor the Monte Carlos statistics to see
if I can stop sampling, i.e., the statistics are converged.

The current system consists of the following steps:

1) Prepare a "sufficiently large" number of cases, each with random
variations of the input parameters per the tolerance DSL markup.
Save all these input variables and all their samples for step 5.

So the DSL generates a 'large' number of cases listing the input
parameters and their associated values. The list of input parameters
static across all cases, or a set of cases?

That is, for a given "experiment" you have the same set of parameters,
just with a "large" numver of different values to put in those
parameters.

2) Run all the cases.
3) Collect all the samples of all the outputs of interest.

I'm also assuming that the output(s) for a given "experiment" would be
consistent in their parameters?

4) Compute running history of the output statistics to see
    if they have have converged, i.e., the "sufficiently large"
    guess was correct -- typically a wasteful number of around 3,000.
    If not, start at step 1 again with a bigger number of cases.

So right now you are doing say for a given experiment f:

inputs : i1,i2,i3,...,in
outputs : o1,o2,o3,...,0m

run f(i1,i2,i3,...,in) -> [o1,...,on] where the values for i1,...,in are
"jiggled" And you have around 3,000 diferent sets of inputs.

5) Compute normalized Pearson correlation coefficients for the
outputs and see which inputs they are most sensitive to by
using the data collected in steps 1 and 3.
6) Lobby for experiments to nail down these "tall pole" uncertainties.

This system is plagued by the question of "sufficiently large"?
The next generation system would do steps 1 through 3 in small
batches, and at the end of each batch, check for the statistical
convergence of step 4. If convergence has been reached, shutdown
the Monte Carlo process, declare victory, and proceed with steps
5 and 6.

[...]

[5] The current system consists of 5 Ruby codes at ~40 lines each
plus some equally tiny library routines.

Would you be willing to share these? I'm not sure if what I'm assuming
about your problem is correct or not, but it intrigues me and I'd like
to fiddle with the general problem :-). I'd be happy to talk offlist
too.

enjoy,

-jeremy

···

On Sat, May 05, 2007 at 11:20:05AM +0900, Bil Kleb wrote:

--

Jeremy Hinegardner jeremy@hinegardner.org

BIl_Kleb1 · 3 May 2007 20:40

Brian Candler wrote:

samples[tag] ||=
samples[tag] << sample

And you can probably combine:

(samples[tag] ||= ) << sample

Thanks; I figured the magic || operator was somehow
involved, but I still don't have that thing memorized...
yet more aggressive incompetence.

Thanks again,

···

--
Bil Kleb
http://fun3d.larc.nasa.gov

Logan_Capaldo · 4 May 2007 14:44

William James wrote:
>
> Would making a copy of the hash use too much
> memory or time?

I don't know, but that's surely another way out of
the Marshal-hash-proc trap...

Thanks,
--
Bil Kleb
http://fun3d.larc.nasa.gov

I dunno how safe it is to rely on this behavior, but calling Hash#default=
strips the proc and makes the hash marshallable:

h = Hash.new { |h,k| h[k] = }

=> {}

h["a"]

=>

h

=> {"a"=>}

h["b"] = 7

=> 7

h

=> {"a"=>, "b"=>7}

h.default = nil

=> nil

h

=> {"a"=>, "b"=>7}

Marshal.dump h

=> "\004\b{\a\"\006a[\000\"\006bi\f"

of course this means you won't get your default proc back when you load the
hash and mutate it. However:
db = Marshal.load( from_disk )
delta = Hash.new { |h, k| h[k] = if db.has_key? k then db[k] else end }
delta["foo"] = "add a new value"
delta["bar"] << "update something in the original"
... more of the same ...
to_disk = Marshal.dump( db.merge!( delta ) )

···

On 5/4/07, Bil Kleb <Bil.Kleb@nasa.gov> wrote:

BIl_Kleb1 · 5 May 2007 13:15

Matt Lawrence wrote:

You're building an Orion? Please tell me it's not true!

No; I should have been more specific..."our Orion vehicle /design/".

Later,

···

--
Bil Kleb
http://fun3d.larc.nasa.gov

BIl_Kleb1 · 7 May 2007 11:15

Jeremy Hinegardner wrote:

So the DSL generates a 'large' number of cases listing the input
parameters and their associated values. The list of input parameters
static across all cases, or a set of cases?

The input parameter names are static across all cases, but
for each case, the parameter value will vary randomly according
to the tolerance DSL, e.g., 1.05+/-0.2. Currently, I have all
these as a hash of arrays, e.g.,

{ 'F_x' => [ 1.23, 1.12, 0.92, 1.01, ... ],
'q_r' => [ 1.34e+9, 3.89e+8, 8.98e+8, 5.23e+9, ... ], ... }

where 1.23 is the sample for input parameter 'F_x' for the
first case, 1.12 is the sample for the second case, etc.,
and 1.34e+9 is the sample for input parameter 'q_r' for
the first case, and so forth.

That is, for a given "experiment" you have the same set of parameters,
just with a "large" numver of different values to put in those
parameters.

Yes, if I understand you correctly.

2) Run all the cases.
3) Collect all the samples of all the outputs of interest.

I'm also assuming that the output(s) for a given "experiment" would be
consistent in their parameters?

Yes, the output parameter hash has the same structure as the input
hash, although it typically has fewer parameters. The number
and sequence of values (realizations) for each output parameter,
however, corresponds exactly to the array of samples for each input
parameter. For example, the outputs hash may look like,

{ 'heating' => [ 75.23, 76.54, ... ],
'stag_pr' => [ 102.13, 108.02, ... ], ... }

Here, the 2nd realization of the output, 'stag_pr', 108.02 corresponds
to the 2nd case and is associated with the 2nd entries in the 'F_x'
and 'q_r' arrays, 1.12 and 3.89e+8, respectively.

So right now you are doing say for a given experiment f:

inputs : i1,i2,i3,...,in outputs : o1,o2,o3,...,0m

run f(i1,i2,i3,...,in) -> [o1,...,on] where the values for i1,...,in are
"jiggled" And you have around 3,000 diferent sets of inputs.

Yes, where 'inputs' and 'outputs' are vectors m and k long,
respectively; so you have a matrix of values, e.g.,

  input1 : i1_1, i1_2, i1_3, ..., i1_n
  input2 : i2_1, i2_2, i2_3, ..., i2_n
    . . . . . .
    . [ m x n matrix ] .
    . . . . . .
  inputj : im_1, im_2, im_3, ..., im_n

  output1 : o1_1, o1_2, o1_3, ..., o1_n
  output1 : o2_1, o2_2, o2_3, ..., o2_n
    . . . . . .
    . [ k x n matrix ] .
    . . . . . .
  output1 : ok_1, ok_2, ok_3, ..., ok_n

Would you be willing to share these?

/I/ am willing, but unfortunately I'm also mired in red tape.

I'm not sure if what I'm assuming
about your problem is correct or not, but it intrigues me and I'd like
to fiddle with the general problem :-).

The more I explain it, the more I learn about it; so thanks
for the interest.

Regards,

···

--
Bil Kleb
http://fun3d.larc.nasa.gov

Forum · 7 May 2007 11:55

Bill, maybe you want to have a look at JSON
http://json.rubyforge.org/

I do not have time right now to benchmark the reading, but the writing
gives some spectacular results, look at this:

517/17 > cat test-out.rb && ruby test-out.rb
# vim: sts=2 sw=2 expandtab nu tw=0:

require 'yaml'
require 'rubygems'
require 'json'
require 'benchmark'

@hash = Hash[*(1..100).map{|l| "k_%03d" % l}.zip([*1..100]).flatten]

Benchmark.bmbm do
  >bench>
  bench.report( "yaml" ) { 50.times{ @hash.to_yaml } }
  bench.report( "json" ) { 50.times{ @hash.to_json } }
end

Rehearsal ----------------------------------------
yaml 0.630000 0.030000 0.660000 ( 0.748123)
json 0.020000 0.000000 0.020000 ( 0.079732)
------------------------------- total: 0.680000sec

user system total real
yaml 0.590000 0.000000 0.590000 ( 0.754097)
json 0.020000 0.000000 0.020000 ( 0.018363)

Looks promising, n'est-ce pas?

Maybe you want to investigate that a little bit more, JSON is of
course very readable, look e.g at this:

irb(main):002:0> require 'rubygems'
=> true
irb(main):003:0> require 'json'
=> true
irb(main):004:0> {:a => [*42..84]}.to_json
=> "{\"a\":[42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84]}"

HTH
Robert

···

On 5/7/07, Bil Kleb <Bil.Kleb@nasa.gov> wrote:

--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

Jeremy_Hinegardner · 10 May 2007 06:32

Here, the 2nd realization of the output, 'stag_pr', 108.02 corresponds
to the 2nd case and is associated with the 2nd entries in the 'F_x'
and 'q_r' arrays, 1.12 and 3.89e+8, respectively.

Yeah, that's what explains it best for me.

>I'm not sure if what I'm assuming
>about your problem is correct or not, but it intrigues me and I'd like
>to fiddle with the general problem :-).

The more I explain it, the more I learn about it; so thanks
for the interest.

I haven't forgotten about this, I have a couple of ways to manage it
with SQLite but how you want to deal with the data after running all the
cases could influence the direction.

You said you want to test for convergence after cases are run?
Basically you want to do some calculations using the inputs and outputs
after each case ( or N cases ) and save those calculations off to the
side until they reach some error/limit etc? For this do you want to
just record "After case N my running calculations(f,g,h) over the cases run so
far are x,y,z"

That is something like:

    Case Running calc f, Running calc g, Running calc h
       1 1.0 2.0 3.0
       10 2.0 4.0 9.0
       ....

Also, do you do any comparison between experiments? That is, for one
scenario that you have a few thousand cases for with your inputs that
are jiggled; would you do anything with those results in relation to
some other scenario? Or are all scenario/experiments isolated?

enjoy,

-jeremy

···

On Mon, May 07, 2007 at 08:15:05PM +0900, Bil Kleb wrote:

--

Jeremy Hinegardner jeremy@hinegardner.org

BIl_Kleb1 · 7 May 2007 12:35

Robert Dober wrote:

user system total real
yaml 0.590000 0.000000 0.590000 ( 0.754097)
json 0.020000 0.000000 0.020000 ( 0.018363)

Looks promising, n'est-ce pas?

Neat. Thanks for the data. For now though, Marshal
is an adequate alternative.

Regards,

···

--
Bil Kleb
http://fun3d.larc.nasa.gov

Topic		Replies	Views
YAML vs. Marshal ruby-talk	20	148	5 May 2004
Store object in on disk / mini database ruby-talk	18	154	21 September 2004
Should I use a database or a flat file? ruby-talk	22	183	1 April 2008
ANN: Madeleine 0.1 ruby-talk	27	134	21 March 2003
PStore vs. YAML::Store ruby-talk	6	68	12 May 2004

Beyond YAML? (scaling)

--

--

Related topics