Stefan Lang wrote:
Ruby must choose between treating all external data as
text unless told otherwise or treat everything as binary
unless told otherwise, because there is no general way
to know if a file is binary or text.
Yes (and I wouldn't want it to try to guess)
Given that Ruby is mostly used to work with text, it's a
sensible decision to use text mode by default.
That's where I disagree. There are tons of non-text applications:
images, compression, PDFs, Marshall, DRB...
Point taken.
Furthermore, as the OP
demonstrated, there are plenty of usage cases where files are presented
which are almost ASCII, but not quite. The default behaviour now is to
crash, rather than to treat these as streams of bytes.
I don't want my programs to crash in these cases.
Let's compare. Situation: I'm reading binary files and
forget to specify the "b" flag when opening the file(s).
Result in Ruby 1.8:
* My stuff works fine on Linux/Unix. Somebody else runs
the script on Windows, the script corrupts data because
Windows does line ending conversion.
Result in Ruby 1.9:
* On the first run on my Linux machine, I get an EncodingError.
I fix the problem by specifying the "b" flag on open. Done.
I definitely prefer Ruby 1.9 behavior.
It has to default to some encoding.
That's where I also disagree. It can default to stream of bytes.
File.open("....", :encoding => "ENV") # Follow the environment
This is the default.
That's what I don't want. Given this default, I must either:
Assuming the default is always wrong.
(1) Force all my source to have the correct encoding flag set
everywhere.If I don't test for this, my programs will fail in
unexpected ways. Tests for this are awkward; they'd have to set the
environment to a certain locale (e.g. UTF-8), pass in data which is not
valid in that locale, and check no exception is raised.
(2) Use a wrapper script either to call Ruby with the correct
command-line flags, or to sanitise the environment.
Encoding.default_external=
I guess I can use that at the top of everything in bin/ directory. It
may be sufficient, but it's annoying to have to remember that too.
Here's why the default is good, IMO. The cases where I really
don't want to explicitly specify encodings is when I write one liners
(-e) and short throwaway scripts. If the default encoding were binary,
string operations would deal incorrectly with German (my native language)
accents. Using the locale encoding does the right thing here.
If I write a longer program, explicitly setting the default external
encoding isn't an effort worth mentioning. Set it to ASCII_8BIT
and it behaves like 1.8.
Stefan
···
2009/2/16 Brian Candler <b.candler@pobox.com>: