Invalid byte sequence in US-ASCII (ArgumentError)

Luther1 · 16 February 2009 00:15

I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
code:

text.gsub! "\C-m", ''

...which generates this error:

/home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
US-ASCII (ArgumentError)

The purpose is to strip out any ^M characters from the string. I've
tried a couple of different magic comments with utf-8, but the error
message still shows the same "US-ASCII". I also tried changing the \C-m
to 13.chr, but I still got the same error, suggesting that control
characters aren't even allowed in strings anymore.

I'm sure this must be a common migration problem, but I can't find a
solution no matter how hard I search the web. Any help would be greatly
appreciated.

Luther

Tim_Hunter4 · 16 February 2009 00:19

Luther wrote:

I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
code:

text.gsub! "\C-m", ''

...which generates this error:

/home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
US-ASCII (ArgumentError)

The purpose is to strip out any ^M characters from the string. I've
tried a couple of different magic comments with utf-8, but the error
message still shows the same "US-ASCII". I also tried changing the \C-m
to 13.chr, but I still got the same error, suggesting that control
characters aren't even allowed in strings anymore.

I'm sure this must be a common migration problem, but I can't find a
solution no matter how hard I search the web. Any help would be greatly
appreciated.

Luther

Since Ruby is claiming the source file is US-ASCII it seems likely that it's not noticing the magic comment. Make sure your magic comment is the first line in the script, or if you're using a shebang line, the second line. That is, either

# encoding: utf-8

or

#! /usr/local/bin/ruby
# encoding: utf-8

···

--
RMagick: http://rmagick.rubyforge.org/

Yukihiro_Matsumoto2 · 17 February 2009 01:16

Hi,

···

In message "Re: invalid byte sequence in US-ASCII (ArgumentError)" on Mon, 16 Feb 2009 09:15:54 +0900, Luther <lutheroto@gmail.com> writes:

I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
code:

text.gsub! "\C-m", ''

...which generates this error:

/home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
US-ASCII (ArgumentError)

The purpose is to strip out any ^M characters from the string.

I feel some smell of a bug. Could you show me the whole code and
reproducing input please?

matz.

Jason_O · 10 November 2010 18:40

The magic encoding comment didn't cut it for me. I found the answer in
my case by adding the following to my environment.rb (I run a mixed 1.8
and 1.9 environment):

if RUBY_VERSION =~ /1.9/
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
end

···

--
Posted via http://www.ruby-forum.com/.

Luther1 · 16 February 2009 02:18

I put the encoding line right after my shebang line, but it had no
effect.

In further investigation, I tried running my program on a different text
file, and it worked fine. The original text file had some very odd
characters at the beginning and the end of the file. Once I deleted that
metadata, my program worked fine.

This means the problem was with the "text" variable rather than the
arguments. This seems very wrong to me since it threw an ArgumentError.
Or maybe I don't know anything about exceptions.

So, my problem is partially solved, but now I know my program will puke
on any text file with multibyte characters.

Luther

···

On Mon, 2009-02-16 at 09:19 +0900, Tim Hunter wrote:

Luther wrote:
> I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
> code:
>
> text.gsub! "\C-m", ''
>
> ...which generates this error:
>
> /home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
> US-ASCII (ArgumentError)
>
Since Ruby is claiming the source file is US-ASCII it seems likely that
it's not noticing the magic comment. Make sure your magic comment is the
first line in the script, or if you're using a shebang line, the second
line. That is, either

# encoding: utf-8

or

#! /usr/local/bin/ruby
# encoding: utf-8

Luther1 · 17 February 2009 04:27

Sure, here you go...

#!/usr/bin/ruby -w

# Copyright 2007-2009 Luther Thompson. This file is distributed under
# the GNU General Public License, version 3 or any later version.

# Strips out ^M characters from the files given as arguments. This
# helps ensure that emacs can display the text properly.
# Warning: This program will overwrite the original file(s), so pray
# there aren't any serious bugs.

ARGV.each do |filename|

puts "Removing ^M characters from #{filename}"

text = String.new

  File.open filename do |f|
    text = f.gets nil
  end

text.gsub! "\C-m", ''

  File.open filename, 'w' do |f|
    f.puts text
  end

end

__END__

Using Ubuntu 8.10 with ruby installed from source and linked over
to /usr/bin.

$ /usr/bin/ruby -v
ruby 1.9.1p0 (2009-01-30 revision 21907) [x86_64-linux]

The attached file contains all of the apparently binary data that was in
the original text file that I downloaded.

Luther

eva_pt11.txt (3.2 KB)

···

On Tue, 2009-02-17 at 10:16 +0900, Yukihiro Matsumoto wrote:

I feel some smell of a bug. Could you show me the whole code and
reproducing input please?

ThoML · 16 February 2009 05:20

but now I know my program will puke
on any text file with multibyte characters.

Not necessarily.

Here is a useful summary of encodings in 1.9:
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

Basically, you have script encoding, internal encoding, and external
encoding. In you case, you should probably read the files as ASCII8BIT
or binary, I guess.

ThoML · 17 February 2009 07:25

> I feel some smell of a bug. Could you show me the whole code and
> reproducing input please?

Sure, here you go...

When I recently stumbled over not so different problems (one of which
is described here [1]) it was because the external encoding (see
Encoding.default_external) defaulted to US-ASCII on cygwin because
ruby191RC0 ignored the windows locale and the value of the LANG
variable -- the part with the windows locale was fixed in the
meantime. AFAIK if ruby 191 cannot determine the environment's locale,
it defaults to US-ASCII which causes the described problem if a
character is > 127.

[1] http://groups.google.com/group/ruby-talk-google/browse_frm/thread/865fc72d8fb808ba/

Brian_Candler · 16 February 2009 09:44

Tom Link wrote:

but now I know my program will puke
on any text file with multibyte characters.

Not necessarily.

Here is a useful summary of encodings in 1.9:
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

Basically, you have script encoding, internal encoding, and external
encoding. In you case, you should probably read the files as ASCII8BIT
or binary, I guess.

Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
all external data is text, unless explicitly told otherwise.

So if you deal with data which is not text (as I do all the time), you
need to put

File.open("....", :encoding => "BINARY")

everywhere. And even then, if you ask the open File object what it's
encoding is, it will say ASCII8BIT, even though you explicitly told it
that it's BINARY.

This is because "BINARY" is just a synonym for "ASCII8BIT" in ruby. Of
course, there is plenty of data out there which is not encoded using the
American Standard Code for Information Interchange. MIME distinguishes
clearly between 8BIT (text with high bit set) and BINARY (non-text). In
terms of Ruby's processing it makes no difference, but it's annoying for
Ruby to tell me that my data is text, when it is not.

Note: in more recent 1.9's, I believe that

File.open("....", "rb")

has the effect of doing two things:
1. Disabling line-ending translation under Windows
2. Setting encoding to ASCII8BIT

So this may be sufficient for your needs, and it has the advantage that
the same code will run under ruby <1.9.

···

--
Posted via http://www.ruby-forum.com/\.

Luther1 · 17 February 2009 01:33

Thank you. I've put 'r:binary' in the line where I open the file, and
now it seems to work fine. Although if I didn't want to be lazy, I would
probably read it with the default encoding, catch the ArgumentError,
then reread the file.

Thanks again...
Luther

···

On Mon, 2009-02-16 at 14:20 +0900, Tom Link wrote:

> but now I know my program will puke
> on any text file with multibyte characters.

Not necessarily.

Here is a useful summary of encodings in 1.9:
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

Basically, you have script encoding, internal encoding, and external
encoding. In you case, you should probably read the files as ASCII8BIT
or binary, I guess.

Luther1 · 17 February 2009 13:55

Tom Link wrote:

When I recently stumbled over not so different problems (one of which
is described here [1]) it was because the external encoding (see
Encoding.default_external) defaulted to US-ASCII on cygwin because
ruby191RC0 ignored the windows locale and the value of the LANG
variable -- the part with the windows locale was fixed in the
meantime. AFAIK if ruby 191 cannot determine the environment's locale,
it defaults to US-ASCII which causes the described problem if a
character is > 127.

Actually, I always set my LANG to C. Since my original post, I found
that I had forgotten to set my LC_CTYPE to en_US.UTF-8, which is
Ubuntu's default. After fixing that, I still got the same error, but
with "UTF-8" instead of "US-ASCII".

I believe the metadata in that text file must be binary code that was
put there by some word processor, because I remember seeing "Helvetica"
somewhere in there.

Luther

···

--
Posted via http://www.ruby-forum.com/\.

Brian_Candler · 16 February 2009 09:50

Brian Candler wrote:

Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
all external data is text, unless explicitly told otherwise.

And worse: the encoding chosen comes from the environment. So your
program which you developed on one system and runs correctly there may
fail totally on another.

I'm not saying that Ruby shouldn't handling encodings and conversions;
I'm just saying you should ask for them. For example:

   File.open("....", :encoding => "UTF-8") # Use this encoding
   File.open("....", :encoding => "ENV") # Follow the environment
   File.open("....") # No idea, treat as binary

I'm not going to use 1.9 without wrapper scripts to invoke Ruby with
appropriate flags to force the external encoding to a fixed value. And
that's a pain.

···

--
Posted via http://www.ruby-forum.com/\.

Yukihiro_Matsumoto2 · 13 March 2009 10:47

Hi,

···

In message "Re: invalid byte sequence in US-ASCII (ArgumentError)" on Tue, 17 Feb 2009 22:55:23 +0900, Luther Thompson <lutheroto@gmail.com> writes:

Actually, I always set my LANG to C. Since my original post, I found
that I had forgotten to set my LC_CTYPE to en_US.UTF-8, which is
Ubuntu's default. After fixing that, I still got the same error, but
with "UTF-8" instead of "US-ASCII".

Ruby 1.9 distinguish text files and binary files, so please specify
"rb" instead of plain "r" (or omitting the mode) for binary files.
This restriction may be loosen in the future for simple substitution
like this one.

matz.

Stefan_Lang1 · 16 February 2009 11:45

Brian Candler wrote:

Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
all external data is text, unless explicitly told otherwise.

Ruby must choose between treating all external data as
text unless told otherwise or treat everything as binary
unless told otherwise, because there is no general way
to know if a file is binary or text.

Given that Ruby is mostly used to work with text, it's a
sensible decision to use text mode by default.

Also, if you open a file with the "b" flag, it sets the files
encoding to binary. You should use that flag in 1.8, too,
otherwise Windows will do line ending conversion, corrupting
your binary data.

And worse: the encoding chosen comes from the environment. So your
program which you developed on one system and runs correctly there may
fail totally on another.

It has to default to some encoding. Your OS installation has a
default encoding. It's a sane decision to use that, because
otherwise many scripts wouldn't work by default on your machine.

I'm not saying that Ruby shouldn't handling encodings and conversions;
I'm just saying you should ask for them. For example:

File.open("....", :encoding => "UTF-8") # Use this encoding

Well, you can do exactly that...

File.open("....", :encoding => "ENV") # Follow the environment

This is the default.

File.open("....") # No idea, treat as binary

Use "b" flag, which you should do on 1.8 anyway.

I'm not going to use 1.9 without wrapper scripts to invoke Ruby with
appropriate flags to force the external encoding to a fixed value. And
that's a pain.

You can set it with Encoding.default_external= at the top of your script.

Stefan

···

2009/2/16 Brian Candler <b.candler@pobox.com>:

Brian_Candler · 16 February 2009 12:22

Stefan Lang wrote:

Ruby must choose between treating all external data as
text unless told otherwise or treat everything as binary
unless told otherwise, because there is no general way
to know if a file is binary or text.

Yes (and I wouldn't want it to try to guess)

Given that Ruby is mostly used to work with text, it's a
sensible decision to use text mode by default.

That's where I disagree. There are tons of non-text applications:
images, compression, PDFs, Marshall, DRB... Furthermore, as the OP
demonstrated, there are plenty of usage cases where files are presented
which are almost ASCII, but not quite. The default behaviour now is to
crash, rather than to treat these as streams of bytes.

I don't want my programs to crash in these cases.

It has to default to some encoding.

That's where I also disagree. It can default to stream of bytes.

File.open("....", :encoding => "ENV") # Follow the environment

This is the default.

That's what I don't want. Given this default, I must either:

(1) Force all my source to have the correct encoding flag set
everywhere. If I don't test for this, my programs will fail in
unexpected ways. Tests for this are awkward; they'd have to set the
environment to a certain locale (e.g. UTF-8), pass in data which is not
valid in that locale, and check no exception is raised.

(2) Use a wrapper script either to call Ruby with the correct
command-line flags, or to sanitise the environment.

Encoding.default_external=

I guess I can use that at the top of everything in bin/ directory. It
may be sufficient, but it's annoying to have to remember that too.

···

--
Posted via http://www.ruby-forum.com/\.

Stefan_Lang1 · 16 February 2009 14:12

Stefan Lang wrote:

Ruby must choose between treating all external data as
text unless told otherwise or treat everything as binary
unless told otherwise, because there is no general way
to know if a file is binary or text.

Yes (and I wouldn't want it to try to guess)

Given that Ruby is mostly used to work with text, it's a
sensible decision to use text mode by default.

That's where I disagree. There are tons of non-text applications:
images, compression, PDFs, Marshall, DRB...

Point taken.

Furthermore, as the OP
demonstrated, there are plenty of usage cases where files are presented
which are almost ASCII, but not quite. The default behaviour now is to
crash, rather than to treat these as streams of bytes.

I don't want my programs to crash in these cases.

Let's compare. Situation: I'm reading binary files and
forget to specify the "b" flag when opening the file(s).

Result in Ruby 1.8:
* My stuff works fine on Linux/Unix. Somebody else runs
the script on Windows, the script corrupts data because
Windows does line ending conversion.

Result in Ruby 1.9:
* On the first run on my Linux machine, I get an EncodingError.
I fix the problem by specifying the "b" flag on open. Done.

I definitely prefer Ruby 1.9 behavior.

It has to default to some encoding.

That's where I also disagree. It can default to stream of bytes.

File.open("....", :encoding => "ENV") # Follow the environment

This is the default.

That's what I don't want. Given this default, I must either:

Assuming the default is always wrong.

(1) Force all my source to have the correct encoding flag set
everywhere.If I don't test for this, my programs will fail in
unexpected ways. Tests for this are awkward; they'd have to set the
environment to a certain locale (e.g. UTF-8), pass in data which is not
valid in that locale, and check no exception is raised.

(2) Use a wrapper script either to call Ruby with the correct
command-line flags, or to sanitise the environment.

Encoding.default_external=

I guess I can use that at the top of everything in bin/ directory. It
may be sufficient, but it's annoying to have to remember that too.

Here's why the default is good, IMO. The cases where I really
don't want to explicitly specify encodings is when I write one liners
(-e) and short throwaway scripts. If the default encoding were binary,
string operations would deal incorrectly with German (my native language)
accents. Using the locale encoding does the right thing here.

If I write a longer program, explicitly setting the default external
encoding isn't an effort worth mentioning. Set it to ASCII_8BIT
and it behaves like 1.8.

Stefan

···

2009/2/16 Brian Candler <b.candler@pobox.com>:

ThoML · 16 February 2009 15:33

Result in Ruby 1.8:
* My stuff works fine on Linux/Unix. Somebody else runs
the script on Windows, the script corrupts data because
Windows does line ending conversion.

Result in Ruby 1.9:
* On the first run on my Linux machine, I get an EncodingError.
I fix the problem by specifying the "b" flag on open. Done.

Actually, this is a point I have never quite understood. Why does only
the windows version convert line endings? Is it out of question that
somebody could want to process a text file created under windows on a
linux box or virtual machine? Regular expressions that check only for
\n but not \r won't work then. Now you could of course take the stance
that you simply have to check for \r too, but then why automatically
convert line separators under windows? Or did I miss something
obvious?

This is also the reason why I think opening text files as binary isn't
really a solution. It leads to either convoluted regexps or non-
portable code. (Unless I missed something obvious, which is quite
possible.)

I personally find it somewhat confusing having to juggle with
different encodings. IMHO it would have been preferable to define a
fixed internal encoding (uft16 or whatever) and to transcode every
string that is known to be text and identifiers to that canonical/
uniform encoding and to deal with everything else as a sequence of
bytes.

BTW I recently skimmed through the python3000 user guide. From what I
understand, they seem to distinguish between strings as (binary) data
and strings as text (encoded as utf).

Gary_Wright · 16 February 2009 15:44

Because at the operating system level Windows distinguishes between
text and binary files and Unix doesn't.

The "b" option has been part of the standard C library for decades and
Windows is not the only operating system that distinguishes between
text and binary files.

Proper handling of line termination requires the library to know if it
is working with a binary or text file. On Unix it doesn't matter if
you fail to give the library correct information (i.e. omit the b flag
for binary files) but your code becomes non-portable. It will fail on
systems that treat text and binary files differently.

Gary Wright

···

On Feb 16, 2009, at 10:33 AM, Tom Link wrote:

Actually, this is a point I have never quite understood. Why does only
the windows version convert line endings?

Stefan_Lang1 · 16 February 2009 15:57

Result in Ruby 1.8:
* My stuff works fine on Linux/Unix. Somebody else runs
  the script on Windows, the script corrupts data because
  Windows does line ending conversion.

Result in Ruby 1.9:
* On the first run on my Linux machine, I get an EncodingError.
   I fix the problem by specifying the "b" flag on open. Done.

Actually, this is a point I have never quite understood. Why does only
the windows version convert line endings? Is it out of question that
somebody could want to process a text file created under windows on a
linux box or virtual machine? Regular expressions that check only for
\n but not \r won't work then. Now you could of course take the stance
that you simply have to check for \r too, but then why automatically
convert line separators under windows? Or did I miss something
obvious?

It's the underlying C API that does the line ending conversion.
Ruby inherited that behavior.

This is also the reason why I think opening text files as binary isn't
really a solution. It leads to either convoluted regexps or non-
portable code. (Unless I missed something obvious, which is quite
possible.)

I personally find it somewhat confusing having to juggle with
different encodings. IMHO it would have been preferable to define a
fixed internal encoding (uft16 or whatever) and to transcode every
string that is known to be text and identifiers to that canonical/
uniform encoding and to deal with everything else as a sequence of
bytes.

Ruby does that when you set the internal encoding with
Encoding.default_internal=

BTW I recently skimmed through the python3000 user guide. From what I
understand, they seem to distinguish between strings as (binary) data
and strings as text (encoded as utf).

There were many and long discussions about the encoding API,
mostly on Ruby core. If you search the archives you can find
why the current API is how it is.

IIRC, these were important issues:

* We don't have a single internal string encoding (like Java and Python)
  because there are many Ruby users, especially Asian, that still
  have to work with legacy encodings for which a lossless Unicode
  round-trip is not possible. They'd be forced to use the
  binary API.

* Because Ruby already has a rich String API, and because it simplifies
porting of 1.8 code, there is no separate data type for binary strings.

Stefan

···

2009/2/16 Tom Link <micathom@gmail.com>:

ThoML · 16 February 2009 16:34

It's the underlying C API that does the line ending conversion.
Ruby inherited that behavior.

Thanks for the clarification (also thanks to Gary).

Ruby does that when you set the internal encoding with
Encoding.default_internal=

Unfortunately this isn't entirely true -- see:
http://groups.google.com/group/ruby-core-google/browse_frm/thread/9103687d4ee9f336?hl=en#

It doesn't convert strings & identifiers in scripts. Of course, there
are good reasons for that (see the responses in the thread) if you
don't define a canonical internal encoding.

There were many and long discussions about the encoding API,
mostly on Ruby core.

I know that people who understand the issues at hand much better than
I do discussed this subject extensively. I still struggle to fully
understand their conclusions though. But this probably is only a
matter of time.

Regards,
Thomas.

Topic		Replies	Views
Ruby 1.9.1: Encoding trouble: broken US-ASCII String ruby-talk	21	211	16 December 2008
[ruby 1.9] reading an UTF-8 encoded file ruby-talk	12	206	11 March 2010
Querying using HTTP ruby-talk	12	95	17 April 2009
Ruby 1.9 hates you and me and the encodings we rode in on so just get used to it ruby-talk	28	213	31 December 2009
A question about Ruby 1.9's "external encoding" ruby-talk	10	173	23 March 2011

Invalid byte sequence in US-ASCII (ArgumentError)

Related topics