Ruby to approximate 'file'?

Is there a file-type detection script available for Ruby, similar to
the unix ‘file’ program? At a minimum, is there a way in Ruby to
reliably tell the difference between a binary file format and a text
file format?

I’m working on a script which will run ‘tidy’ recursively against all
files in a directory structure – but it should only run on files
which are not binary. (I use a Windows-based CMS, City Desk, that
doesn’t do XHTML; I’m going to use Tidy to convert the non-XHTML
parts to XHTML during the publish process.)

I don’t want to make the detection extension-based, as there’s
already a wide variety of possible extensions that could be used
(most of them outside of my own application, but I plan on making
this publicly available, possibly with a Ruby-ized Tidy) – from htm
and html to cfm and asp to xml and css.

Ideally, I’d be able to come up with a MIME-type for the file and
create callbacks based on the MIME-type and/or extension (I may wish
to make an HTML-fragment which calls Tidy with the show-body-only
option set).

I can get my local version working easily – it’s predictable. I’m
just looking at doing this in a “larger” way…

-austin
– Austin Ziegler, austin@halostatue.ca on 2002.10.05 at 01.36.45

Is there a file-type detection script available for Ruby, similar to
the unix ‘file’ program? At a minimum, is there a way in Ruby to
reliably tell the difference between a binary file format and a text
file format?

Here’s something in Ruby that I wrote to do that. It’s based on the
algorithm that Perl uses to determine “textness” and “binariness”:

module FileTest

private

def self.isText(block)
return (block.count(“^ -~”, “^\b\f\t\r\n”) < (block.size / 3.0) &&
block.count(“\x00”) < 1)
end

public

The textfile? and binaryfile? methods are not inverses of each

other. Both return false if the item is not a file, or if the

item is a zero-length file. The “textness” or “binariness” of

an item can only be determined if it’s a file that contains at

least one byte.

def self.textfile?(item)
size = self.size(item)
blksize = File.stat(item).blksize
if size < 1 then
return false
end
begin
open(item) {
>file>
block = file.read(blksize < size ? blksize : size)
return self.isText(block)
}
rescue
return false
end
end

def self.binaryfile?(item)
size = self.size(item)
blksize = File.stat(item).blksize
if size < 1 then
return false
end
begin
open(item) {
>file>
block = file.read(blksize < size ? blksize : size)
return !self.isText(block)
}
rescue
return false
end
end
end

···

[ … ]


Lloyd Zusman
ljz@asfast.com

Sat, 5 Oct 2002 15:38:40 +0900, Lloyd Zusman ljz@asfast.com pisze:

The textfile? and binaryfile? methods are not inverses of each

other. Both return false if the item is not a file, or if the

item is a zero-length file.

In Perl both are true for an empty file.

···


__("< Marcin Kowalczyk
__/ qrczak@knm.org.pl
^^ Blog człowieka poczciwego.

Austin Ziegler austin@halostatue.ca wrote in message news:20021005055015.CIXT20369.tomts5-srv.bellnexxia.net@hogwarts

Is there a file-type detection script available for Ruby, similar to
the unix ‘file’ program? At a minimum, is there a way in Ruby to
reliably tell the difference between a binary file format and a text
file format?

There’s a native win32 port of ‘file’, if that helps:

martin

“Lloyd Zusman” ljz@asfast.com wrote in message
news:m28z1d4cgj.fsf@asfast.com

[…code snip]

def self.textfile?(item)
size = self.size(item)
blksize = File.stat(item).blksize
^^^^^
if size < 1 then
return false
end

[…code snip]

statfile.blksize returns 0 in ruby 1.7.2 (2002-07-02) [i386-mswin32]
So I modified it to

 blksize = File.stat(item).blksize
 blksize = 2048 if (blksize == 0)

Hope I did the right thing …
Thanks,

– shanko

Thanks. That’ll be a good start to make sure that I don’t run tidy on
my JPGs (:

-austin
– Austin Ziegler, austin@halostatue.ca on 2002.10.05 at 11.59.53

···

On Sat, 5 Oct 2002 15:38:40 +0900, Lloyd Zusman wrote:

Is there a file-type detection script available for Ruby, similar
to the unix ‘file’ program? At a minimum, is there a way in Ruby to
reliably tell the difference between a binary file format and a
text file format?
Here’s something in Ruby that I wrote to do that. It’s based on the
algorithm that Perl uses to determine “textness” and “binariness”: