Hi ,
how to test if a file is binary or not ?
There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class
Any ideas or libraries available ?
Regards, Gilbert
Hi ,
how to test if a file is binary or not ?
There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class
Any ideas or libraries available ?
Regards, Gilbert
What to you need to achieve with this is_binary? method?
All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.
On Aug 21, 8:04 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> wrote:
Hi ,
how to test if a file is binary or not ?
There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:ClassAny ideas or libraries available ?
Regards, Gilbert
If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.
Something like this:
class File
def self.binary?(name)
ascii = control = binary = 0
File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
case bt
when 0...32
control += 1
when 32...128
ascii += 1
else
binary += 1
end
end
control.to_f / ascii > 0.1 || binary.to_f / ascii > 0.05
end
end
Kind regards
robert
2007/8/21, Rebhan, Gilbert <Gilbert.Rebhan@huk-coburg.de>:
Hi ,
how to test if a file is binary or not ?
There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:ClassAny ideas or libraries available ?
gem install ptools
require 'ptools'
File.binary?(file)
Regards,
Dan
On Aug 21, 12:04 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> wrote:
Hi ,
how to test if a file is binary or not ?
There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class
Hi,
-----Original Message-----
From: dima [mailto:dejan.dimic@gmail.com]
Sent: Tuesday, August 21, 2007 8:50 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?
On Aug 21, 8:04 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> wrote:
Hi ,
how to test if a file is binary or not ?
There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:ClassAny ideas or libraries available ?
What to you need to achieve with this is_binary? method?
All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.
For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or ascii
Regards, Gilbert
Hi,
-----Original Message-----
From: Robert Klemme [mailto:shortcutter@googlemail.com]
Sent: Tuesday, August 21, 2007 9:05 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?
2007/8/21, Rebhan, Gilbert <Gilbert.Rebhan@huk-coburg.de>:
Hi ,
how to test if a file is binary or not ?
There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:ClassAny ideas or libraries available ?
/*
If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.
Something like this:
class File
def self.binary?(name)
ascii = control = binary = 0
File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
case bt
when 0...32
control += 1
when 32...128
ascii += 1
else
binary += 1
end
end
control.to_f / ascii > 0.1 || binary.to_f / ascii > 0.05
end
end
*/
Nice Thanks !!
Regards, Gilbert
* Robert Klemme <shortcutter@googlemail.com> (09:04) schrieb:
If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.
That only shows how many non-ascii-characters are used. It won't
recognise russian script in utf-8 as text, or uuencode as binary.
What diff (and thus rcs, cvs, svn ...) cares about is lines. Something
is text if it's logically organized in short lines, and eohl cahracters
are used only for ending lines.
class File
def self.binary?(name)
cr, len, mlen = false, 0, 0
File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
return false if cr and bt != 10
case bt
when 13
cr = true
when 10
mlen = len if len > mlen
len = 0
else
len += 1
end
end
mlen > 1000
end
end
I chose 1000 as the maximum line length, to fit whole paragraphs in one
line. But of course the maximum of the proceeding tool is relevant here.
There is the right place to do the check anyway.
mfg, simon .... l
Don't forget the possibility, that a file ist encoded in UTF-16 or
UTF-32. To recognize these textual data you need an extra recognition
step in front of the rest.
Wolfgang WoNáDo
--
Posted via http://www.ruby-forum.com/.
One simple approach is this:
class File
def is_binary?
ascii = 0
total = 0
self.read(1024).each_byte{|c| total += 1; ascii +=1 if c >= 128 or c == 0}
ascii.to_f / total.to_f > 0.33 ? true : false
end
end
You can tweak the 0.33 value if you like. Probably better (i.e. more robust) ways out there though.
Alex Gutteridge
Bioinformatics Center
Kyoto University
On 21 Aug 2007, at 15:57, Rebhan, Gilbert wrote:
Hi,
-----Original Message-----
From: dima [mailto:dejan.dimic@gmail.com]
Sent: Tuesday, August 21, 2007 8:50 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?On Aug 21, 8:04 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> > wrote:
Hi ,
how to test if a file is binary or not ?
There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:ClassAny ideas or libraries available ?
What to you need to achieve with this is_binary? method?
All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or asciiRegards, Gilbert
Hi,
From: dima [mailto:dejan.dimic@gmail.com]
Sent: Tuesday, August 21, 2007 8:50 AM
Subject: Re: Test if file is binary ?>> how to test if a file is binary or not ?
>>
>> There ain't something like File.binary =
>> NoMethodError: undefined method `binary?' for File:Class>What to you need to achieve with this is_binary? method?
For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or ascii
I'm impressed by the solutions of Alex and Robert. Anyway I
suppose in most cases a test on one single null character
will suffice. Something like this:
class File
def binary?
while (b=f.read(256)) do
return true if b[ "\0"]
end
end
end
Yet I recommend first to review whether you want to read the
file later. In this case you may abort reading when the file
fails a more sophisticated filetype check.
Dividing files into "text" and "binary" is the archetype
misdesign in the operating system you use. (Is there
anything designed well (besides Outlook, of course?)) The
distinction doesn't refer to the files _contents_ but how to
the file is _treated_ when it is being read or written. In
"rb"/"wb" modes files are left how they are, in "r"/"w"
modes Windows programmers get line ends "\r\n" translated
into "\n" what disturbs file positions and string lengths.
I think the only purpose of this is to detain programmers
from doing anything a non-Microsoft way.
Bertram
Am Dienstag, 21. Aug 2007, 15:57:13 +0900 schrieb Rebhan, Gilbert:
On Aug 21, 8:04 am, "Rebhan, Gilbert" <Gilbert.Reb...@huk-coburg.de> > wrote:
--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de
Simon Krahnke wrote:
* Robert Klemme <shortcutter@googlemail.com> (09:04) schrieb:
If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.That only shows how many non-ascii-characters are used. It won't
recognise russian script in utf-8 as text, or uuencode as binary.What diff (and thus rcs, cvs, svn ...) cares about is lines. Something
is text if it's logically organized in short lines, and eohl cahracters
are used only for ending lines.
[snip]
I chose 1000 as the maximum line length, to fit whole paragraphs in one
line. But of course the maximum of the proceeding tool is relevant here.
There is the right place to do the check anyway.
That's why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000 characters.
This is on my personal list of 'bad practices', but it may be
appropriate to others.
My 0.02EUR
Stefan
Sorry for the duplicate! Robert is too fast for me.
Alex Gutteridge
Bioinformatics Center
Kyoto University
Hi,
class File
def binary?
while (b=f.read(256)) do
return true if b[ "\0"]
end
end
end
This is blunder, of course. Some better ones:
def File.binary? name
open name do |f|
while (b=f.read(256)) do
return true if b[ "\0"]
end
end
false
end
def File.binary? name
open name do |f|
f.each_byte { |x|
x.nonzero? or return true
}
end
false
end
Just to be corrrect.
Bertram
Am Dienstag, 21. Aug 2007, 18:06:26 +0900 schrieb Bertram Scharpf:
--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de
* Stefan Mahlitz <stefan@mahlitz-net.de> (22:40) schrieb:
That's why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000 characters.
You can't seriously treat a file with lines longer than 8000 characters
as line oriented. It's far from being readable by a human. You declare
that file as application/xml.
One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.
This is on my personal list of 'bad practices', but it may be
appropriate to others.
I think it's bad practice to declare something with huge lines as text.
mfg, simon .... l
It's always good to see more solutions. I like the conciseness of
your solution. But I think this should rather be a class method
because you would not do the test on an open stream. Dunno which of
the solutions is more realistic. Might be fun to let both approaches
test a large number of files and compare their results (probably also
with output from "file").
Btw, you should get rid of the ternary operator - it's totally
superfluous because there is no point in converting a boolean value
into a boolean value.
Kind regards
robert
2007/8/21, Alex Gutteridge <alexg@kuicr.kyoto-u.ac.jp>:
Sorry for the duplicate! Robert is too fast for me.
Simon Krahnke wrote:
* Stefan Mahlitz <stefan@mahlitz-net.de> (22:40) schrieb:
That's why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000 characters.You can't seriously treat a file with lines longer than 8000 characters
as line oriented. It's far from being readable by a human. You declare
that file as application/xml.
Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).
But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn't this be text?
Why do you think it is not readable?
One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.
Sorry, I fail to see your point. Are we really judging whether a file is
text by how much memory pages a diff will take or how many characters a
patch has?
I couldn't find a definition of text except that text means absence of
binary data. This is weak - so I would follow your definition - A text
file is a file which can be read by a human.
This is on my personal list of 'bad practices', but it may be
appropriate to others.I think it's bad practice to declare something with huge lines as text.
Well, I disagree.
But to get (slightly at least) ontopic again, if I would have to detect
whether a file is text I would go with a combination of Robert Klemmes
and Bertram Schrapfs solutions.
Stefan
Sorry for the duplicate! Robert is too fast for me.
/*
It's always good to see more solutions. I like the conciseness of
your solution. But I think this should rather be a class method
because you would not do the test on an open stream. Dunno which of
the solutions is more realistic.
*/
you mean it should be something like ? =
class File
def self.is_binary?(name)
ascii = total = 0
File.open(name, "rb") { |io| io.read(1024) }.each_byte do |c|
total += 1;
ascii +=1 if c >= 128 or c == 0
end
ascii.to_f / total.to_f > 0.33
end
end
/*
Might be fun to let both approaches
test a large number of files and compare their results (probably also
with output from "file").
*/
Is there an exisiting standard what is considered as a binary file,
means a
rule like check the first block from a file and =
- if control characters (ASCII 0-32) and "high ASCII" (> 128) are found
30 %
it's considered as binary file otherwise textfile
- if control characters (ASCII 0-32 and > 128) are found == 0 it's
always
considered as textfile
??
Regards, Gilbert
-----Original Message-----
From: Robert Klemme [mailto:shortcutter@googlemail.com]
Sent: Tuesday, August 21, 2007 9:41 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?
2007/8/21, Alex Gutteridge <alexg@kuicr.kyoto-u.ac.jp>:
* Stefan Mahlitz <stefan@mahlitz-net.de> (09:25) schrieb:
Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).
Note that Subversion would just treat the file as binary and process it
with its binary diff.
But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn't this be text?
If that's really a paragraph there is no problem, except maybe of style.
(When wrapped in lines of 80 characters, it's 100 lines!)
Why do you think it is not readable?
I think that an XML file that has huge lines is unreadable since a
human wouldn't recognize any structure, when all the elements are on a
single line.
One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.Sorry, I fail to see your point.
That's another point. Or actually two. The hard one being the limitation
of the software used: It may have a maximum line length.
Are we really judging whether a file is text by how much memory pages
a diff will take or how many characters a patch has?
No, this has nothing to do with being text, just with being well suited
as input to a diff algorithm.
Text usually is suited as well as everything else that is line oriented
and typical changes affect only one or a few neighboring lines.
mfg, simon .... l
What's the heuristic in Subversion?
-- fxn
On Aug 21, 2007, at 10:21 AM, Rebhan, Gilbert wrote:
Is there an exisiting standard what is considered as a binary file,
means a
rule like check the first block from a file and =- if control characters (ASCII 0-32) and "high ASCII" (> 128) are found
30 %
it's considered as binary file otherwise textfile
- if control characters (ASCII 0-32 and > 128) are found == 0 it's
always
considered as textfile??
# Is there an exisiting standard what is considered as a binary file,
if you're on a *nix (non-windows) box, you should use the os file command and then just wrap it in ruby,
irb(main):022:0> def is_bin(f)
irb(main):023:1> %x(file #{f}) !~ /text/
irb(main):024:1> end
=> nil
irb(main):025:0> is_bin "test.rb"
=> false
irb(main):026:0> is_bin "test.txt"
=> false
irb(main):027:0> is_bin "/usr/local/bin/dnscache"
=> true
irb(main):028:0> is_bin "/bin/ps"
=> true
irb(main):029:0> def is_text(f)
irb(main):030:1> %x(file #{f}) =~ /text/
irb(main):031:1> end
=> nil
irb(main):032:0> is_text "test.rb"
=> 27
irb(main):033:0> is_text "test.txt"
=> 16
irb(main):034:0> is_text "/usr/local/bin/dnscache"
=> nil
irb(main):035:0> is_text "/bin/ps"
=> nil
kind regards -botp
From: Rebhan, Gilbert [mailto:Gilbert.Rebhan@huk-coburg.de]