Test if file is binary?

Yep. But I'd leave the "is_" out - that's handled by the "?" already.

Cheers

robert

···

2007/8/21, Rebhan, Gilbert <Gilbert.Rebhan@huk-coburg.de>:

-----Original Message-----
From: Robert Klemme [mailto:shortcutter@googlemail.com]
Sent: Tuesday, August 21, 2007 9:41 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

2007/8/21, Alex Gutteridge <alexg@kuicr.kyoto-u.ac.jp>:
> Sorry for the duplicate! Robert is too fast for me.

/*
It's always good to see more solutions. I like the conciseness of
your solution. But I think this should rather be a class method
because you would not do the test on an open stream. Dunno which of
the solutions is more realistic.
*/

you mean it should be something like ? =

class File
   def self.is_binary?(name)
     ascii = total = 0
     File.open(name, "rb") { |io| io.read(1024) }.each_byte do |c|
        total += 1;
        ascii +=1 if c >= 128 or c == 0
     end
     ascii.to_f / total.to_f > 0.33
   end
end

It also disables newline normalization (which may or may not be an issue in that case).

-- fxn

···

On Aug 22, 2007, at 1:35 PM, Simon Krahnke wrote:

* Stefan Mahlitz <stefan@mahlitz-net.de> (09:25) schrieb:

Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).

Note that Subversion would just treat the file as binary and process it
with its binary diff.

Simon Krahnke wrote:

* Stefan Mahlitz <stefan@mahlitz-net.de> (09:25) schrieb:

Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).

Note that Subversion would just treat the file as binary and process it
with its binary diff.

I didn't know this. Thanks for the info.

But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn't this be text?

If that's really a paragraph there is no problem, except maybe of style.
(When wrapped in lines of 80 characters, it's 100 lines!)

Agreed. But it is still text - which was the point I tried to make.

Why do you think it is not readable?

I think that an XML file that has huge lines is unreadable since a
human wouldn't recognize any structure, when all the elements are on a
single line.

My question was directed to the 8000 char-paragraph. I even find small
xml-files unreadable - so I completely agree with you that 8000 chars of
xml-data in a single line is far from being readable by a human. Anyway
- xml is meant to be processed by machines.

But even this case I would classify as text (I'm changing my earlier
definition slightly) if it does not contain binary data. The xml in a
file is semantics. And I assume the question text or binary refers to
syntax.

One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.

Sorry, I fail to see your point.

That's another point. Or actually two. The hard one being the limitation
of the software used: It may have a maximum line length.

If I understand the original poster correctly he wants to
programmatically detect whether a file is "binary or text". My point was
that he shouldn't restrict his program artifically - but this depends on
context.

Are we really judging whether a file is text by how much memory pages
a diff will take or how many characters a patch has?

No, this has nothing to do with being text, just with being well suited
as input to a diff algorithm.

Text usually is suited as well as everything else that is line oriented
and typical changes affect only one or a few neighboring lines.

These are things I'm normally not concerned about, that's why I couldn't
follow that subject change.

Do I summarize correctly that depending on the purpose of the check one
could use a maximum line length - or any other of the posted approaches?

Aka 'use the right tool for the job' + 'There is no single answer to
this question'?

Stefan

/*
What's the heuristic in Subversion?
*/

the subversion FAQ
http://subversion.tigris.org/faq.html#binary-files has =
" ...
if any of the bytes are zero, or if more than 15% are not ASCII printing
characters,
then Subversion calls the file binary. This heuristic might be improved
in the future, however."

Regards, Gilbert

···

-----Original Message-----
From: Xavier Noria [mailto:fxn@hashref.com]
Sent: Tuesday, August 21, 2007 10:25 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

On Aug 21, 2007, at 10:21 AM, Rebhan, Gilbert wrote:

Is there an exisiting standard what is considered as a binary file,
means a
rule like check the first block from a file and =

- if control characters (ASCII 0-32) and "high ASCII" (> 128) are
found

30 %

it's considered as binary file otherwise textfile

- if control characters (ASCII 0-32 and > 128) are found == 0 it's
always
considered as textfile

??

* Xavier Noria <fxn@hashref.com> (15:24) schrieb:

Note that Subversion would just treat the file as binary and process
it with its binary diff.

It also disables newline normalization (which may or may not be an
issue in that case).

Which is configurable for text files, too.

mfg, simon .... end of off topic

* Stefan Mahlitz <stefan@mahlitz-net.de> (20:46) schrieb:

My question was directed to the 8000 char-paragraph. I even find small
xml-files unreadable

Well, there is lot of XML files that I find readable. Including many I
or my software wrote.

Of course there are perversions like XMI and Microsoft's new formats.

- so I completely agree with you that 8000 chars of xml-data in a
single line is far from being readable by a human.

And thus it's binary and not text.

Anyway - xml is meant to be processed by machines.

It's meant to be read by an XML parser, which a regular diff isn't. So
only special cases are well suited for diff, and other special cases are
human readable.

But even this case I would classify as text (I'm changing my earlier
definition slightly) if it does not contain binary data.

I would say it's text when interpreted as text/plain it's human
readable. Otherwise it's binary. That is, binary = for machines only.

If I understand the original poster correctly he wants to
programmatically detect whether a file is "binary or text". My point was
that he shouldn't restrict his program artifically - but this depends on
context.

Yes, in the original post he didn't say, for what purpose. If it's for
diffing the line structure is what matters.

Do I summarize correctly that depending on the purpose of the check one
could use a maximum line length - or any other of the posted approaches?

The other approaches are good for deciding if the files contains text in
latin based scripts. That's only a small subset of text, and they will
happily classify base64 as text.

Aka 'use the right tool for the job' + 'There is no single answer to
this question'?

Yes. Probably the best approach was using file(1).

mfg, simon .... l