Marshal formats

For certain unicode operations I need to store and read a fairly large
amounts of data. Currently I'm using Marshal.dump to generate the data,
but I've got the impression that the Marshal format isn't just
different on different Ruby versions, but also on the various
platforms.

I've tried to use YAML and Ruby source files to store the data, but
this results in very large files and reading them takes forever.

I've also considered install time generation of the data, but that has
some practical problems for distribution.

Is there another marshal format I've missed or is there a better way to
do this?

Manfred Stienstra wrote:

For certain unicode operations I need to store and read a fairly large
amounts of data. Currently I'm using Marshal.dump to generate the data,
but I've got the impression that the Marshal format isn't just
different on different Ruby versions, but also on the various
platforms.

Marshal format should be identical across platforms. (Otherwise, drb wouldn't be very useful.)

The marshal code has some degree of backwards compatibility:

$ ri Marshal | cat
--------------------------------------------------------- Class: Marshal
      The marshaling library converts collections of Ruby objects into a
      byte stream, allowing them to be stored outside the currently
      active script. This data may subsequently be read and the original
      objects reconstituted. Marshaled data has major and minor version
      numbers stored along with the object information. In normal use,
      marshaling can only load data written with the same major version
      number and an equal or lower minor version number. If Ruby's
      ``verbose'' flag is set (normally using -d, -v, -w, or --verbose)
      the major and minor numbers must match exactly. Marshal versioning
      is independent of Ruby's version numbers. You can extract the
      version by reading the first two bytes of marshaled data.

          str = Marshal.dump("thing")
          RUBY_VERSION #=> "1.8.0"
          str[0] #=> 4
          str[1] #=> 8

      Some objects cannot be dumped: if the objects to be dumped include
      bindings, procedure or method objects, instances of class IO, or
      singleton objects, a TypeError will be raised. If your class has
      special serialization needs (for example, if you want to serialize
      in some specific format), or if it contains objects that would
      otherwise not be serializable, you can implement your own
      serialization strategy by defining two methods, _dump and _load:
      The instance method _dump should return a String object containing
      all the information necessary to reconstitute objects of this
      class and all referenced objects up to a maximum depth given as an
      integer parameter (a value of -1 implies that you should disable
      depth checking). The class method _load should take a String and
      return an object of this class.

ยทยทยท

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Joel VanderWerf wrote:

Marshal format should be identical across platforms. (Otherwise, drb
wouldn't be very useful.)

Oops, yes, I guess you're right. Just remember to open the Marshal file
in windows as binary (: