I have some big files with lot of "unsigned int" (4 bytes) numbers and I
want to read and write on these files.
Currently, I found this to write:
myfile << [mynum].pack("i")
and to read:
mynum = myfile.read(4).unpack("i").first
I wonder if there's not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.
----- Original Message -----
From: "Vianney Lecroart" <acemtp@gmail.com>
Newsgroups: comp.lang.ruby
To: "ruby-talk ML" <ruby-talk@ruby-lang.org>
Sent: Thursday, October 25, 2007 11:36 PM
Subject: read write integer in binary into a file
Hello,
I have some big files with lot of "unsigned int" (4 bytes) numbers and I
want to read and write on these files.
Currently, I found this to write:
myfile << [mynum].pack("i")
and to read:
mynum = myfile.read(4).unpack("i").first
I wonder if there's not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.
I wrote a function to do this which seems slightly faster, but could
perhaps stand some optimization:
def pack_int32(n)
str = ' '
str[3] = n >> 24
str[2] = n >> 16
str[1] = n >> 8
str[0] = n
str
end
Here are the benchmark results vs the other methods mentioned:
user system total real
.pack(i): 6.234000 0.235000 6.469000 ( 6.500000)
pack_int32: 5.719000 0.015000 5.734000 ( 5.734000)
Marshal.dump: 6.594000 0.219000 6.813000 ( 6.813000)
I included Marshal.dump for completeness, but agree that it doesn't
appear to be meant for this sort of thing. Here's the source to run
the benchmark:
require 'benchmark'
number = 2_000_000
n = 1_000_000
Benchmark.bm(12) do |x|
x.report('.pack(i):') { n.times do; [number].pack('i'); end }
x.report('pack_int32:') { n.times do; pack_int32(number); end }
x.report('Marshal.dump:') { n.times do; Marshal.dump(number); end }
end
Adam
···
On Oct 25, 10:36 am, Vianney Lecroart <ace...@gmail.com> wrote:
Hello,
I have some big files with lot of "unsigned int" (4 bytes) numbers and I
want to read and write on these files.
Currently, I found this to write:
myfile << [mynum].pack("i")
and to read:
mynum = myfile.read(4).unpack("i").first
I wonder if there's not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.
Do you have to deal with each number individually? Maybe you could
build up an array of numbers and then pack them all at once:
arr =
while work_to_do do
mynum = generate_next_number
arr << mynum
end
myfile.write arr.pack('i*')
That way you aren't creating a new array for each number.
Similarly, for reading the file:
data = file.read
num_array = data.unpack('i*')
The '*' in (un)pack means to process the rest of the data in the same
way.
···
On Oct 25, 9:36 am, Vianney Lecroart <ace...@gmail.com> wrote:
Hello,
I have some big files with lot of "unsigned int" (4 bytes) numbers and I
want to read and write on these files.
Currently, I found this to write:
myfile << [mynum].pack("i")
and to read:
mynum = myfile.read(4).unpack("i").first
I wonder if there's not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.
I have some big files with lot of "unsigned int" (4 bytes) numbers and I
want to read and write on these files.
Currently, I found this to write:
myfile << [mynum].pack("i")
and to read:
mynum = myfile.read(4).unpack("i").first
I wonder if there's not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.
Thank you.
irb(main):001:0> f=open('test','w')
=> #<File:test>
irb(main):002:0> f<<[65535].pack('i')
=> #<File:test>
irb(main):003:0> f.tell
=> 4
irb(main):004:0> f<<[720850].pack('i')
=> #<File:test>
irb(main):005:0> f.tell
=> 9
the integer 720850 takes 5 bytes in my file,but it should take 4 bytes
only!How can I fix this?Thanks!
Using only the number 2_000_000 seems to skew the results. I see your
results with your test, but if I change it slightly to use a variety
of integers, I get more balanced results:
require 'benchmark'
MAX = 2**30
n = 1_000_000
nums = (0..n).map{ (rand*MAX).to_i }
user system total real
pack(i): 5.687000 0.125000 5.812000 ( 5.875000)
pack32: 5.141000 0.016000 5.157000 ( 5.188000)
Dump: 6.000000 0.078000 6.078000 ( 6.141000)
···
On Oct 25, 10:09 am, Adam Preble <pre...@gmail.com> wrote:
I wrote a function to do this which seems slightly faster, but could
perhaps stand some optimization:
def pack_int32(n)
str = ' '
str[3] = n >> 24
str[2] = n >> 16
str[1] = n >> 8
str[0] = n
str
end
Here are the benchmark results vs the other methods mentioned:
user system total real
.pack(i): 6.234000 0.235000 6.469000 ( 6.500000)
pack_int32: 5.719000 0.015000 5.734000 ( 5.734000)
Marshal.dump: 6.594000 0.219000 6.813000 ( 6.813000)
I included Marshal.dump for completeness, but agree that it doesn't
appear to be meant for this sort of thing. Here's the source to run
the benchmark:
require 'benchmark'
number = 2_000_000
n = 1_000_000
Benchmark.bm(12) do |x|
x.report('.pack(i):') { n.times do; [number].pack('i'); end }
x.report('pack_int32:') { n.times do; pack_int32(number); end }
x.report('Marshal.dump:') { n.times do; Marshal.dump(number); end }
end
irb(main):001:0> f=open('test','w')
=> #<File:test>
irb(main):002:0> f<<[65535].pack('i')
=> #<File:test>
irb(main):003:0> f.tell
=> 4
irb(main):004:0> f<<[720850].pack('i')
=> #<File:test>
irb(main):005:0> f.tell
=> 9
the integer 720850 takes 5 bytes in my file,but it should take 4 bytes
only!How can I fix this?Thanks!
irb
irb(main):001:0> x = [720850].pack('i')
=> "\322\377\n\000"
irb(main):002:0> x.length
=> 4
So clearly the integer 720850 is packed into 4 bytes as requested. Why
does it occupy 5 bytes in the file? But see the "\n" in position 2? That
means that the 3rd byte is a newline character, and on Windows, in text
files, Ruby turns newlines into CRLF. 2 bytes! Since you've got binary
data in your file you don't want to write a text file, so you must open
the file with the "b" flag in addition to "w":