Determining if a file is binary or text

Hi all,

I need to search text files for a given expression and flag a warning/
error if that expression does not exist. I'm going to search a large
number of files using the Linux "find" command, so I won't know if
they are binary or text.

I realize that this can be OS-dependent and can be tricky to
determine. I was going to use the Linux "file" command which works
well in providing human-readable information about the file; however,
due to a variety of possible file types, I cannot easily determine the
file type without specifying every single possible text file format to
consider. For example, the "file" command can produce the following
(all of which are ASCII):

ASCII text
XML document text
Lisp/Scheme program text
...

Is there an easy way to do this in Ruby? After looking around quite a
bit, I thought about looking at a few first lines of the file and
matching against this regular expression:

# Character class:
# [:print:] Any printable character, including space
line.match(/^[[:print:]]+$/)

Which I believe could work. Any comments?

Thanks,
-James

This question is not well defined.

Think about UTF8 and ISO-8859-1...

Basically, stop and think what you *mean* by "binary or text". Once you've
articulated that more clearly, you may well have a much better notion of what
you mean.

Would you be expecting to not see this message in a "binary" file? If so,
why are they different? What about binary files makes them not need the
message (or what about text files makes them not need it...)? If you mean
"executables", you might approximate decently by checking the execute
permission bit...

-s

···

On 2009-09-18, James Masters <james.d.masters@gmail.com> wrote:

I need to search text files for a given expression and flag a warning/
error if that expression does not exist. I'm going to search a large
number of files using the Linux "find" command, so I won't know if
they are binary or text.

--
Copyright 2009, all wrongs reversed. Peter Seebach / usenet-nospam@seebs.net
| Seebs.Net <-- lawsuits, religion, and funny pictures
Fair game (Scientology) - Wikipedia <-- get educated!

Evening James.

...

For example, the "file" command can produce the following

(all of which are ASCII):

ASCII text
XML document text
Lisp/Scheme program text

What about file -i which returns the MIME type instead of "human readable"
format. That should limit the choices it will return or at least give you
something you can work with.

John

···

On Fri, Sep 18, 2009 at 4:15 PM, James Masters <james.d.masters@gmail.com>wrote:

Just using a single "+" seems too unsafe to me: you need only three matching bytes which does not seem too unlikely even for binary files.

Some more random thoughts: if you use Ruby to determine file types you can as well use Find.find to find all files removing the dependency to an external program.

A complete different approach would be to define classes of bytes and do statistics on the first n bytes from the file, e.g.

32-127, \r, \n, \t printable
0-31 without \n, \t, \r, 128-255 non printable

Then determine based on ratio of occurrences. Of course, that approach can also be tricky...

Kind regards

  robert

···

On 19.09.2009 01:14, James Masters wrote:

Hi all,

I need to search text files for a given expression and flag a warning/
error if that expression does not exist. I'm going to search a large
number of files using the Linux "find" command, so I won't know if
they are binary or text.

I realize that this can be OS-dependent and can be tricky to
determine. I was going to use the Linux "file" command which works
well in providing human-readable information about the file; however,
due to a variety of possible file types, I cannot easily determine the
file type without specifying every single possible text file format to
consider. For example, the "file" command can produce the following
(all of which are ASCII):

ASCII text
XML document text
Lisp/Scheme program text
...

Is there an easy way to do this in Ruby? After looking around quite a
bit, I thought about looking at a few first lines of the file and
matching against this regular expression:

# Character class:
# [:print:] Any printable character, including space
line.match(/^[[:print:]]+$/)

Which I believe could work. Any comments?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

By convention, source and object files use standardized file-type
extensions, which should help you weed out files to ignore.

As a starting point ask the developers what file-type extensions
they're using. As a second check, run something like the following
commands at the top of the path you'll be checking:

     find . | xargs -n1 basename | egrep '\.\w+$' | awk -F. {'print
$2'} | sort -u

to give you a list of possible extensions, then check those too.

Use "file" and "file -i" to do a best-guess once you've narrowed your
possibilities. Both use "magic" files which define where file should
look inside a target file to determine what type it is. They are
fallible though and you can get false positives. Do a "man magic" from
the command-line on your Linux box for more info.

Also, be careful assuming only binary files have \x00 bytes or high-
order ASCII. Old text files that have migrated from other systems
could have them, as could files where someone ALT+fat-fingered on the
keypad as could a source file coming from a non-english speaking
nation where the developer used variable names in his native language.
You just never know what you'll find in those pesky source files.

James Masters wrote:

Hi all,

I need to search text files for a given expression and flag a warning/
error if that expression does not exist. I'm going to search a large
number of files using the Linux "find" command, so I won't know if
they are binary or text.

require 'ptools'

File.binary?(your_file)

Regards,

Dan

How about a file that contains any single byte character (0-255) that
you cannot find a key for on a standard US keyboard (English)? The
[:print:] regular expression character set comprises the range of
characters 32-126, which is what I believe that I need, but I wanted
to see if there are better ways to accomplish this.

Basically I'm trying to search for the presence of a header in source
code files (which may have various extensions or no extensions at
all). The source code files are mixed with executable and non-
executable "binary" files (data files; not something that you can
read). I don't want to flag the non-source code files as not having a
header. The scope of this problem is small so I don't need to worry
about any character sets, etc.

I realize that this can be a complicated problem to solve, but there
are solutions to it. For example, the Linux "file" command is a
robust solution but does not meet my needs for the previously stated
reason. I also know that SVN can automatically detect binary files as
well.

Hopefully this helps clear things up...

Thanks,
-James

···

On Sep 18, 5:12 pm, Seebs <usenet-nos...@seebs.net> wrote:

Basically, stop and think what you *mean* by "binary or text". Once you've
articulated that more clearly, you may well have a much better notion of what
you mean.

Hi John - that's a good idea - I looked over the "file" command
options over and over again today and somehow I missed this.

···

On Sep 18, 8:47 pm, John W Higgins <wish...@gmail.com> wrote:

What about file -i which returns the MIME type instead of "human readable"
format. That should limit the choices it will return or at least give you
something you can work with.

Robert Klemme wrote:

Hi all,

I need to search text files for a given expression and flag a warning/
error if that expression does not exist. I'm going to search a large
number of files using the Linux "find" command, so I won't know if
they are binary or text.

I realize that this can be OS-dependent and can be tricky to
determine. I was going to use the Linux "file" command which works
well in providing human-readable information about the file; however,
due to a variety of possible file types, I cannot easily determine the
file type without specifying every single possible text file format to
consider. For example, the "file" command can produce the following
(all of which are ASCII):

ASCII text
XML document text
Lisp/Scheme program text
...

Is there an easy way to do this in Ruby? After looking around quite a
bit, I thought about looking at a few first lines of the file and
matching against this regular expression:

# Character class:
# [:print:] Any printable character, including space
line.match(/^[[:print:]]+$/)

Which I believe could work. Any comments?

Just using a single "+" seems too unsafe to me: you need only three matching bytes which does not seem too unlikely even for binary files.

Some more random thoughts: if you use Ruby to determine file types you can as well use Find.find to find all files removing the dependency to an external program.

A complete different approach would be to define classes of bytes and do statistics on the first n bytes from the file, e.g.

32-127, \r, \n, \t printable
0-31 without \n, \t, \r, 128-255 non printable

I have a problem with considering 128-255 being non-printable. A lot of these characters are printable, and can be part of text, much like I use Alt-0xxx keys in Pagemaker a lot. The other problem with saying a file is not a text file is determining what is meant by a text file. Is it strictly a file with only Ascii text like a log file, or does it include formated text like word processor file? Word processing and spreadsheet files contain many characters that are considered non-printable but display as text with the correct program.

···

On 19.09.2009 01:14, James Masters wrote:

Then determine based on ratio of occurrences. Of course, that approach can also be tricky...

Kind regards

    robert

How about a file that contains any single byte character (0-255) that
you cannot find a key for on a standard US keyboard (English)? The
[:print:] regular expression character set comprises the range of
characters 32-126, which is what I believe that I need, but I wanted
to see if there are better ways to accomplish this.

Well, you probably also want tabs and newlines. :slight_smile:

I would think that [:print:] might also, in some locales, get you things
like accented letters. Whether or not you want this is harder to say.

Basically I'm trying to search for the presence of a header in source
code files (which may have various extensions or no extensions at
all). The source code files are mixed with executable and non-
executable "binary" files (data files; not something that you can
read). I don't want to flag the non-source code files as not having a
header. The scope of this problem is small so I don't need to worry
about any character sets, etc.

I thought that until I found a dozen Makefiles with copyright symbols
embedded in them. :stuck_out_tongue:

I'd say as a first approximation, just check for NUL bytes. I'm pretty
sure that the vast majority of binary files will contain at least one,
and the vast majority of text files will contain none.

-s

···

On 2009-09-19, James Masters <james.d.masters@gmail.com> wrote:
--
Copyright 2009, all wrongs reversed. Peter Seebach / usenet-nospam@seebs.net
| Seebs.Net <-- lawsuits, religion, and funny pictures
Fair game (Scientology) - Wikipedia <-- get educated!

FWIW Subversion flags binaries automatically.

If svn does that, I guess there's gonna be some heuristics that work
reasonably well in practice.

Robert Klemme wrote:

Hi all,

I need to search text files for a given expression and flag a warning/
error if that expression does not exist. I'm going to search a large
number of files using the Linux "find" command, so I won't know if
they are binary or text.

I realize that this can be OS-dependent and can be tricky to
determine. I was going to use the Linux "file" command which works
well in providing human-readable information about the file; however,
due to a variety of possible file types, I cannot easily determine the
file type without specifying every single possible text file format to
consider. For example, the "file" command can produce the following
(all of which are ASCII):

ASCII text
XML document text
Lisp/Scheme program text
...

Is there an easy way to do this in Ruby? After looking around quite a
bit, I thought about looking at a few first lines of the file and
matching against this regular expression:

# Character class:
# [:print:] Any printable character, including space
line.match(/^[[:print:]]+$/)

Which I believe could work. Any comments?

Just using a single "+" seems too unsafe to me: you need only three matching bytes which does not seem too unlikely even for binary files.

Some more random thoughts: if you use Ruby to determine file types you can as well use Find.find to find all files removing the dependency to an external program.

A complete different approach would be to define classes of bytes and do statistics on the first n bytes from the file, e.g.

32-127, \r, \n, \t printable
0-31 without \n, \t, \r, 128-255 non printable

I have a problem with considering 128-255 being non-printable. A lot of these characters are printable, and can be part of text, much like I use Alt-0xxx keys in Pagemaker a lot.

That was just an example. Of course you can use a different classification (for example, adding a third category for 127-255). I assume those characters are comparatively rare in text files so the general approach would still work.

The other problem with saying a file is not a text file is determining what is meant by a text file. Is it strictly a file with only Ascii text like a log file, or does it include formated text like word processor file? Word processing and spreadsheet files contain many characters that are considered non-printable but display as text with the correct program.

I fully agree: the difficult part is in deciding: what is a text file? If that has been clarified enough the algorithm for checking should become much more obvious.

Cheers

  robert

···

On 20.09.2009 00:03, Michael W. Ryder wrote:

On 19.09.2009 01:14, James Masters wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Well, you probably also want tabs and newlines. :slight_smile:

Ah, good point... :slight_smile:

I thought that until I found a dozen Makefiles with copyright symbols
embedded in them. :stuck_out_tongue:

I'd say as a first approximation, just check for NUL bytes. I'm pretty
sure that the vast majority of binary files will contain at least one,
and the vast majority of text files will contain none.

Yeah, this is another idea that I had also considered... I'm just not
sure if all of the binary files that I'm dealing with have NULL bytes
though. But that might just be good enough.

Fortunately, I'm working with a small team of individuals who will be
authoring the files so I do have some control on the type of text that
I'm looking for. So I might try [:print], \n, \t, and maybe \r (just
in case) and then fall back on the NULL idea as a Plan B.

Thanks again,
-James

···

On Sep 18, 7:54 pm, Seebs <usenet-nos...@seebs.net> wrote:

I agree - this is what it comes down to. BTW, I tried the following
on my project (using Find#find to get the tree) on the first 40
"lines" (which I know can theoretically be very short or long in a
"binary" file) and it seems to work for what I'm doing. This works
well for me also because I'm checking for the presence of a header and
I can do this check along with checking for a header while the file is
still open:

line.match(/^[[:print:]\t\n\r]+$/)

But probably a better approach would be to use a ratio of characters
that are printable against those that may traditionally be non-
printable in the even that some "non-printable" characters are present
in a text file. This is what SVN does (found the link from a post
from Xavier on a "ptools" website when I Googled it):

http://subversion.tigris.org/faq.html#binary-files

And it also appears to be what File#binary? is doing in ptools (I
checked the source code; thanks Dan for the pointer).

···

On Sep 20, 2:17 am, Robert Klemme <shortcut...@googlemail.com> wrote:

I fully agree: the difficult part is in deciding: what is a text file?
If that has been clarified enough the algorithm for checking should
become much more obvious.

How many files are you dealing with?

Hmm. Some source files (scripts, say) will be executable, so you can't
assumme executables are binaries. But... You might want to experiment with
testing a few likely heuristics and maybe making a chart. Say, make a list
of:

TEST: .jpg x-bit NUL 128-255

FILE:
foo.jpg X - X X
foo.sh - X - -
...

and then look to see whether you can make some simple rules, like
"everything with .jpg or .gif is definitely a binary." If you can
get a couple of simple rules that deal with 90% of so of the files,
then you can look at the remainder as a separate case and work from
there.

Don't feel compelled to make a single perfect test when three easy tests
that handle 70% of the cases might give you a remaining pool for which
it's much easier to write a good test.

-s

···

On 2009-09-19, James Masters <james.d.masters@gmail.com> wrote:

Fortunately, I'm working with a small team of individuals who will be
authoring the files so I do have some control on the type of text that
I'm looking for. So I might try [:print], \n, \t, and maybe \r (just
in case) and then fall back on the NULL idea as a Plan B.

--
Copyright 2009, all wrongs reversed. Peter Seebach / usenet-nospam@seebs.net
| Seebs.Net <-- lawsuits, religion, and funny pictures
Fair game (Scientology) - Wikipedia <-- get educated!