Read text and binary files

Eric_Peterson · 30 June 2014 17:14

Some one just wrote: "we're hungry for some actual Ruby discussion." So
here's my question worthy of 2¢ of all ya'll's time (but not much more).

I was playing around and thinking I'd like to look at files and see what
sort of UNICODE characters are in each and how many. I was using this to
validate some data files that were sent to use before I attempted to upload
to the database.

Using Perl I was able to do this

\\\\\\ perl incomplete snippet /////
use open IO => ':utf8'; # all I/O in utf8
no warnings 'utf8'; # but ignore utf-8 warnings
binmode( STDIN, ":utf8" );
binmode( STDOUT, ":utf8" );
binmode( STDERR, ":utf8" );
use Unicode::UCD 'charinfo';

open( my $fh, '<', $file ) or die "Unable to open $file - $!\n";
while ( $line = <$fh> ) {
  my @chars = split( //, $line );
  foreach my $char ( @chars )
...
    $info->{code}
    $info->{name}
...
///// perl incomplete snippet \\\\\\

\\\\\\ perl output /////
Dec Hex Letter Count Desc

     1 9 0x0009 [HT] 2 C0 Control Character - Horizontal
Tabulation (^I \t)
     2 10 0x000A [LF] 332 C0 Control Character - Line Feed (^J \n)
     3 32 0x0020 [SP] 1,821 Space
     4 33 0x0021 [!] 7 EXCLAMATION MARK
     5 34 0x0022 ["] 42 QUOTATION MARK
///// perl output \\\\\\\

Ok, so now I want to try the same in ruby. Where the perl script above can
read text and binary files, the ruby snippet below can only do text files.
The reason I'd like to have it read binary files, there are some bad files
occasionally sent with characters that define it as binary. I'd like to
respond to the vendor with which extraneous characters they have included
in the file and which line it is on.

I though of doing "rb:utf-8:-" on the File.open, but that didn't work
either.

Any ideas?

\\\\\\ ruby /////
#! /usr/bin/env ruby
# -*- encoding: utf-8 -*-
require "unicode_utils"
File.open( fn, "r:utf-8:-" ) do |input|

input.each_line do |line|
line.each_char do |c|

puts UnicodeUtils.char_name( c )
...
///// ruby \\\\\\\

···

--

Addis_Aden · 1 July 2014 09:06

Hi,

I am not so familiar with unicode but the difference with binary and
textfiles is that in binary-mode every byte which is not ascii is presented
as \x.. so also the unicode characters are presented as 2 or more \x..

Maybe you can read the string first as binary and use the method
force_encoding (
Class: String (Ruby 2.0.0)) to set it
to utf-8.

How many files do you have to examine?

best regards
adrian

···

2014-06-30 19:14 GMT+02:00 Eric Peterson <epeterson@rhapsody.com>:

Some one just wrote: "we're hungry for some actual Ruby discussion." So
here's my question worthy of 2¢ of all ya'll's time (but not much more).

I was playing around and thinking I'd like to look at files and see what
sort of UNICODE characters are in each and how many. I was using this to
validate some data files that were sent to use before I attempted to upload
to the database.

Using Perl I was able to do this

\\\\\\ perl incomplete snippet /////
use open IO => ':utf8'; # all I/O in utf8
no warnings 'utf8'; # but ignore utf-8 warnings
binmode( STDIN, ":utf8" );
binmode( STDOUT, ":utf8" );
binmode( STDERR, ":utf8" );
use Unicode::UCD 'charinfo';

open( my $fh, '<', $file ) or die "Unable to open $file - $!\n";
while ( $line = <$fh> ) {
  my @chars = split( //, $line );
  foreach my $char ( @chars )
...
    $info->{code}
    $info->{name}
...
///// perl incomplete snippet \\\\\\

\\\\\\ perl output /////
   Dec Hex Letter Count Desc

     1 9 0x0009 [HT] 2 C0 Control Character - Horizontal
Tabulation (^I \t)
     2 10 0x000A [LF] 332 C0 Control Character - Line Feed (^J
\n)
     3 32 0x0020 [SP] 1,821 Space
     4 33 0x0021 [!] 7 EXCLAMATION MARK
     5 34 0x0022 ["] 42 QUOTATION MARK
///// perl output \\\\\\\

Ok, so now I want to try the same in ruby. Where the perl script above
can read text and binary files, the ruby snippet below can only do text
files. The reason I'd like to have it read binary files, there are some
bad files occasionally sent with characters that define it as binary. I'd
like to respond to the vendor with which extraneous characters they have
included in the file and which line it is on.

I though of doing "rb:utf-8:-" on the File.open, but that didn't work
either.

Any ideas?

\\\\\\ ruby /////
#! /usr/bin/env ruby
# -*- encoding: utf-8 -*-
require "unicode_utils"
File.open( fn, "r:utf-8:-" ) do |input|

  input.each_line do |line|
    line.each_char do |c|

puts UnicodeUtils.char_name( c )
...
///// ruby \\\\\\\

--

Robert_K1 · 1 July 2014 09:43

You were pretty close:

$ ruby -e 'File.open("xx", "rb") {|io| p io.external_encoding}'
#<Encoding:ASCII-8BIT>

ASCII-8BIT is binary:

$ ruby -e 'p Encoding::BINARY'
#<Encoding:ASCII-8BIT>

For comparison

$ ruby -e 'File.open("xx", "r") {|io| p io.external_encoding}'
#<Encoding:UTF-8>

Does that help?

Kind regards

robert

···

On Mon, Jun 30, 2014 at 7:14 PM, Eric Peterson <epeterson@rhapsody.com> wrote:

I though of doing "rb:utf-8:-" on the File.open, but that didn't work
either.

Any ideas?

--
[guy, jim].each {|him| remember.him do |as, often| as.you_can - without end}
http://blog.rubybestpractices.com/

Topic		Replies	Views
Can I know a file is binary or text in ruby? ruby-talk	2	247	19 July 2002
Something that corresponds to Perl's -T and -B tests? ruby-talk	6	167	11 August 2002
How to test for text file ruby-talk	4	183	19 April 2003
Unicode in Ruby and a Ruby Reference ruby-talk	9	125	15 December 2004
Reading binary files (or strings) ruby-talk	1	155	1 June 2002

Read text and binary files

Related topics