In my experience, the fastest way to access files (by far) is mmap.
I’ve written some C extensions to Ruby that use mmap to read files, and
they is dramatically faster the versions that I wrote using standard,
buffered IO. All of the stuff I have written is for a specific purpose
(e.g., a state machine to read character delimited files). Is there
any generic Ruby Extension that gives provides access within ruby to
mmap?
Following is some code that copied out of I out of one of my C
extensions to give you a quick idea of what it takes to read using mmap:
#include <fcntl.h> /* read/write flags /
#include <errno.h> / error numbers /
#include <unistd.h> / open, close functions /
#include <sys/types.h> / typedefs /
#include <sys/stat.h> / stat structures /
#include <sys/mman.h> / mmap */
…snip…
/* the csv is passed as a parameter in the function I yanked this from
/
int len;
char buff;
char path = RSTRING(csv)->ptr;
int fd = open(path, O_RDONLY, 0);
struct stat buffStat;
/ error control code /
if (fd < 0) {
if (errno == EMFILE || errno == ENFILE) {
rb_gc();
fd = stat(path, &buffStat);
}
if (fd < 0) {
close(fd);
rb_sys_fail(path);
}
}
if (stat(path, &buffStat) < 0) {
if (errno == EMFILE || errno == ENFILE) rb_gc();
if (stat(path, &buffStat)) {
close(fd);
rb_sys_fail(path);
}
}
/ here we actually do the wor /
len = buffStat.st_size;
buff = (char)mmap(NULL, len, PROT_READ, 0, fd, 0);
close(fd); /* close the file, mmap doesn’t need it open /
if (buff == MAP_FAILED) rb_sys_fail(path); / one last check for
validity */
From here, you just proceed to rifle through the file at a blazing
speed; e.g., something like:
while (position < len) {
position++;
…
if (rb_block_given_p() && buff[position] == “\n”) {
rb_yield(…);
}
}
etc.
Pass the line terminator as a parameter in the function, and yield the
accumulated result in between lines.
The only problem here is that mmap is part of the posix standard, so
that it will work fine under Unix, but I have no idea how well (or even
how, for that matter) it will work with Windows.
I hope this was useful,
Best,
Dave
···
On Thursday, April 10, 2003, at 12:03 PM, Jim Freeze wrote:
Hello:
We’re having a little shoot out here at work with Ruby, Perl and Tcl.
So far, Ruby kicked on a recursive Fibonacci(sp?) sequence with
Perl about 50% slower and Tcl 10x slower.
Next we’re looking at IO. So far, Perl is about as fast
as cp and Ruby is 50% slower and consumes over twice the
CPU (see table below):
ruby 60.07u 21.32s 1:31.62 88.8%
cp 0.01u 6.64s 0:53.34 12.4%
perl 16.79u 7.66s 0:58.76 41.6%
Oh, the Tcl results? The code is still being written.
The Ruby code is below. I know there have been multiple
posts on ruby-talk about this with discussions about
sysread and read, but not being an IO expert, I have
only been able to follow these at a high level.
Could some expert look at the code below and tell me
what could be done to speed up the code below. Possibly
using #read or #sysread. Or, if someone has some ruby C
code, that would be cool.
#------ rw.rb
file = ARGV.shift
File.open(file + “.out”, “w”) { |of|
File.open(file).each {|line|
# do processing here
of.print line
}
}
#-------
Thanks
–
Jim Freeze
Different all twisty a of in maze are you, passages little.
David King Landrith
(w) 617.227.4469x213
(h) 617.696.7133
One useless man is a disgrace, two
are called a law firm, and three or more
become a congress – John Adams
public key available upon request