Test if file is binary?

Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class

Any ideas or libraries available ?

Regards, Gilbert

What to you need to achieve with this is_binary? method?
All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.

···

On Aug 21, 8:04 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> wrote:

Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class

Any ideas or libraries available ?

Regards, Gilbert

If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.
Something like this:

class File
  def self.binary?(name)
    ascii = control = binary = 0

    File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
      case bt
        when 0...32
          control += 1
        when 32...128
          ascii += 1
        else
          binary += 1
      end
    end

    control.to_f / ascii > 0.1 || binary.to_f / ascii > 0.05
  end
end

Kind regards

robert

···

2007/8/21, Rebhan, Gilbert <Gilbert.Rebhan@huk-coburg.de>:

Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class

Any ideas or libraries available ?

gem install ptools
require 'ptools'
File.binary?(file)

Regards,

Dan

···

On Aug 21, 12:04 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> wrote:

Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class

Hi,

···

-----Original Message-----
From: dima [mailto:dejan.dimic@gmail.com]
Sent: Tuesday, August 21, 2007 8:50 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

On Aug 21, 8:04 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> wrote:

Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class

Any ideas or libraries available ?

What to you need to achieve with this is_binary? method?
All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.

For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or ascii

Regards, Gilbert

Hi,

···

-----Original Message-----
From: Robert Klemme [mailto:shortcutter@googlemail.com]
Sent: Tuesday, August 21, 2007 9:05 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

2007/8/21, Rebhan, Gilbert <Gilbert.Rebhan@huk-coburg.de>:

Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class

Any ideas or libraries available ?

/*

If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.
Something like this:

class File
  def self.binary?(name)
    ascii = control = binary = 0

    File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
      case bt
        when 0...32
          control += 1
        when 32...128
          ascii += 1
        else
          binary += 1
      end
    end

    control.to_f / ascii > 0.1 || binary.to_f / ascii > 0.05
  end
end

*/

Nice :slight_smile: Thanks !!

Regards, Gilbert

* Robert Klemme <shortcutter@googlemail.com> (09:04) schrieb:

If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.

That only shows how many non-ascii-characters are used. It won't
recognise russian script in utf-8 as text, or uuencode as binary.

What diff (and thus rcs, cvs, svn ...) cares about is lines. Something
is text if it's logically organized in short lines, and eohl cahracters
are used only for ending lines.

class File
  def self.binary?(name)
    cr, len, mlen = false, 0, 0
    File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
      return false if cr and bt != 10
      case bt
        when 13
          cr = true
        when 10
          mlen = len if len > mlen
          len = 0
        else
          len += 1
      end
    end
    mlen > 1000
  end
end

I chose 1000 as the maximum line length, to fit whole paragraphs in one
line. But of course the maximum of the proceeding tool is relevant here.
There is the right place to do the check anyway.

mfg, simon .... l

Don't forget the possibility, that a file ist encoded in UTF-16 or
UTF-32. To recognize these textual data you need an extra recognition
step in front of the rest.

Wolfgang WoNáDo

···

--
Posted via http://www.ruby-forum.com/.

One simple approach is this:

class File
   def is_binary?
     ascii = 0
     total = 0
     self.read(1024).each_byte{|c| total += 1; ascii +=1 if c >= 128 or c == 0}
     ascii.to_f / total.to_f > 0.33 ? true : false
   end
end

You can tweak the 0.33 value if you like. Probably better (i.e. more robust) ways out there though.

Alex Gutteridge

Bioinformatics Center
Kyoto University

···

On 21 Aug 2007, at 15:57, Rebhan, Gilbert wrote:

Hi,

-----Original Message-----
From: dima [mailto:dejan.dimic@gmail.com]
Sent: Tuesday, August 21, 2007 8:50 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

On Aug 21, 8:04 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> > wrote:

Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class

Any ideas or libraries available ?

What to you need to achieve with this is_binary? method?
All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.

For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or ascii

Regards, Gilbert

Hi,

From: dima [mailto:dejan.dimic@gmail.com]
Sent: Tuesday, August 21, 2007 8:50 AM
Subject: Re: Test if file is binary ?

>> how to test if a file is binary or not ?
>>
>> There ain't something like File.binary =
>> NoMethodError: undefined method `binary?' for File:Class

>What to you need to achieve with this is_binary? method?

For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or ascii

I'm impressed by the solutions of Alex and Robert. Anyway I
suppose in most cases a test on one single null character
will suffice. Something like this:

   class File
     def binary?
       while (b=f.read(256)) do
         return true if b[ "\0"]
       end
     end
   end

Yet I recommend first to review whether you want to read the
file later. In this case you may abort reading when the file
fails a more sophisticated filetype check.

Dividing files into "text" and "binary" is the archetype
misdesign in the operating system you use. (Is there
anything designed well (besides Outlook, of course?)) The
distinction doesn't refer to the files _contents_ but how to
the file is _treated_ when it is being read or written. In
"rb"/"wb" modes files are left how they are, in "r"/"w"
modes Windows programmers get line ends "\r\n" translated
into "\n" what disturbs file positions and string lengths.
I think the only purpose of this is to detain programmers
from doing anything a non-Microsoft way.

Bertram

···

Am Dienstag, 21. Aug 2007, 15:57:13 +0900 schrieb Rebhan, Gilbert:

On Aug 21, 8:04 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> > wrote:

--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de

Simon Krahnke wrote:

* Robert Klemme <shortcutter@googlemail.com> (09:04) schrieb:

If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.

That only shows how many non-ascii-characters are used. It won't
recognise russian script in utf-8 as text, or uuencode as binary.

What diff (and thus rcs, cvs, svn ...) cares about is lines. Something
is text if it's logically organized in short lines, and eohl cahracters
are used only for ending lines.

[snip]

I chose 1000 as the maximum line length, to fit whole paragraphs in one
line. But of course the maximum of the proceeding tool is relevant here.
There is the right place to do the check anyway.

That's why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000 characters.

This is on my personal list of 'bad practices', but it may be
appropriate to others.

My 0.02EUR

Stefan

Sorry for the duplicate! Robert is too fast for me.

Alex Gutteridge

Bioinformatics Center
Kyoto University

Hi,

   class File
     def binary?
       while (b=f.read(256)) do
         return true if b[ "\0"]
       end
     end
   end

This is blunder, of course. Some better ones:

  def File.binary? name
    open name do |f|
      while (b=f.read(256)) do
        return true if b[ "\0"]
      end
    end
    false
  end

  def File.binary? name
    open name do |f|
      f.each_byte { |x|
        x.nonzero? or return true
      }
    end
    false
  end

Just to be corrrect.

Bertram

···

Am Dienstag, 21. Aug 2007, 18:06:26 +0900 schrieb Bertram Scharpf:

--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de

* Stefan Mahlitz <stefan@mahlitz-net.de> (22:40) schrieb:

That's why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000 characters.

You can't seriously treat a file with lines longer than 8000 characters
as line oriented. It's far from being readable by a human. You declare
that file as application/xml.

One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.

This is on my personal list of 'bad practices', but it may be
appropriate to others.

I think it's bad practice to declare something with huge lines as text.

mfg, simon .... l

It's always good to see more solutions. I like the conciseness of
your solution. But I think this should rather be a class method
because you would not do the test on an open stream. Dunno which of
the solutions is more realistic. Might be fun to let both approaches
test a large number of files and compare their results (probably also
with output from "file"). :slight_smile:

Btw, you should get rid of the ternary operator - it's totally
superfluous because there is no point in converting a boolean value
into a boolean value. :slight_smile:

Kind regards

robert

···

2007/8/21, Alex Gutteridge <alexg@kuicr.kyoto-u.ac.jp>:

Sorry for the duplicate! Robert is too fast for me.

Simon Krahnke wrote:

* Stefan Mahlitz <stefan@mahlitz-net.de> (22:40) schrieb:

That's why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000 characters.

You can't seriously treat a file with lines longer than 8000 characters
as line oriented. It's far from being readable by a human. You declare
that file as application/xml.

Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).

But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn't this be text?

Why do you think it is not readable?

One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.

Sorry, I fail to see your point. Are we really judging whether a file is
text by how much memory pages a diff will take or how many characters a
patch has?

I couldn't find a definition of text except that text means absence of
binary data. This is weak - so I would follow your definition - A text
file is a file which can be read by a human.

This is on my personal list of 'bad practices', but it may be
appropriate to others.

I think it's bad practice to declare something with huge lines as text.

Well, I disagree.

But to get (slightly at least) ontopic again, if I would have to detect
whether a file is text I would go with a combination of Robert Klemmes
and Bertram Schrapfs solutions.

Stefan

Sorry for the duplicate! Robert is too fast for me.

/*
It's always good to see more solutions. I like the conciseness of
your solution. But I think this should rather be a class method
because you would not do the test on an open stream. Dunno which of
the solutions is more realistic.
*/

you mean it should be something like ? =

class File
   def self.is_binary?(name)
     ascii = total = 0
     File.open(name, "rb") { |io| io.read(1024) }.each_byte do |c|
  total += 1;
  ascii +=1 if c >= 128 or c == 0
     end
     ascii.to_f / total.to_f > 0.33
   end
end

/*
Might be fun to let both approaches
test a large number of files and compare their results (probably also
with output from "file"). :slight_smile:
*/

Is there an exisiting standard what is considered as a binary file,
means a
rule like check the first block from a file and =

- if control characters (ASCII 0-32) and "high ASCII" (> 128) are found

30 %

it's considered as binary file otherwise textfile

- if control characters (ASCII 0-32 and > 128) are found == 0 it's
always
considered as textfile

??

Regards, Gilbert

···

-----Original Message-----
From: Robert Klemme [mailto:shortcutter@googlemail.com]
Sent: Tuesday, August 21, 2007 9:41 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

2007/8/21, Alex Gutteridge <alexg@kuicr.kyoto-u.ac.jp>:

* Stefan Mahlitz <stefan@mahlitz-net.de> (09:25) schrieb:

Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).

Note that Subversion would just treat the file as binary and process it
with its binary diff.

But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn't this be text?

If that's really a paragraph there is no problem, except maybe of style.
(When wrapped in lines of 80 characters, it's 100 lines!)

Why do you think it is not readable?

I think that an XML file that has huge lines is unreadable since a
human wouldn't recognize any structure, when all the elements are on a
single line.

One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.

Sorry, I fail to see your point.

That's another point. Or actually two. The hard one being the limitation
of the software used: It may have a maximum line length.

Are we really judging whether a file is text by how much memory pages
a diff will take or how many characters a patch has?

No, this has nothing to do with being text, just with being well suited
as input to a diff algorithm.

Text usually is suited as well as everything else that is line oriented
and typical changes affect only one or a few neighboring lines.

mfg, simon .... l

What's the heuristic in Subversion?

-- fxn

···

On Aug 21, 2007, at 10:21 AM, Rebhan, Gilbert wrote:

Is there an exisiting standard what is considered as a binary file,
means a
rule like check the first block from a file and =

- if control characters (ASCII 0-32) and "high ASCII" (> 128) are found

30 %

it's considered as binary file otherwise textfile

- if control characters (ASCII 0-32 and > 128) are found == 0 it's
always
considered as textfile

??

# Is there an exisiting standard what is considered as a binary file,

if you're on a *nix (non-windows) box, you should use the os file command and then just wrap it in ruby,

irb(main):022:0> def is_bin(f)
irb(main):023:1> %x(file #{f}) !~ /text/
irb(main):024:1> end
=> nil
irb(main):025:0> is_bin "test.rb"
=> false
irb(main):026:0> is_bin "test.txt"
=> false
irb(main):027:0> is_bin "/usr/local/bin/dnscache"
=> true
irb(main):028:0> is_bin "/bin/ps"
=> true
irb(main):029:0> def is_text(f)
irb(main):030:1> %x(file #{f}) =~ /text/
irb(main):031:1> end
=> nil
irb(main):032:0> is_text "test.rb"
=> 27
irb(main):033:0> is_text "test.txt"
=> 16
irb(main):034:0> is_text "/usr/local/bin/dnscache"
=> nil
irb(main):035:0> is_text "/bin/ps"
=> nil

kind regards -botp

···

From: Rebhan, Gilbert [mailto:Gilbert.Rebhan@huk-coburg.de]