Ruby 1.9.1: Encoding trouble: broken US-ASCII String

Hi,

Right now, I'm not exactly thrilled by the way ruby 1.9 handles
encodings. Could somebody please explain things or point me to some
reference material:

I have the following files:

testEncoding.rb:
#!/usr/bin/env ruby
# encoding: ISO-8859-1

p __ENCODING__

text = File.read("text.txt")
text.each_line do |line|
    p line =~ /foo/
end

text.rb:
Foo äöü bar.

I use: ruby 1.9.1 (2008-12-01 revision 20438) [i368-cygwin]

If I run: ruby19 testEncoding.rb, I get:
#<Encoding:ISO-8859-1>
testEncoding.rb:8:in `block in <main>': broken US-ASCII string
(ArgumentError)

Ruby detects the encoding line but suspects the text file to be 7bit
ascii nevertheless. The source file encoding is only respected if I
add the command line option -E ISO-8859-1. I could also set the
encoding explicitly for each string but ...

I found some hints that the default charset for external sources is
deduced from the locale. So I set LANG to de_AT, de_AT.ISO-8859-1 and
some more variants with no avail.

How exactly is this supposed to work? What other options do I have to
make ASCII8BIT or Latin-1 the default encoding without having to
supply an extra command-line option and without having to rely on an
environment variable? Why isn't ASCII8BIT the default in the first
place? Why isn't __ENCODING__ a global variable I can assign a value
too?

Thanks,
Thomas.

Tom Link wrote:

Right now, I'm not exactly thrilled by the way ruby 1.9 handles
encodings. Could somebody please explain things or point me to some
reference material:

I asked the same over at ruby-core recently. There were some useful
replies:

http://www.ruby-forum.com/topic/173179#759661

But the upshot is that this is all pretty much undocumented so far.
(Well it might be documented in the 3rd ed Pickaxe, but I'm not buying
that yet)

text = File.read("text.txt")

This should work:

text = File.read("text.txt", :encoding=>"ISO-8859-1")

I still don't know how the default is worked out though.

Regards,

Brian.

···

--
Posted via http://www.ruby-forum.com/\.

text = File.read("text.txt", :encoding=>"ISO-8859-1")

Unfortunately, this isn't compatible with ruby 1.8. A script that uses
such a construct runs only with ruby 1.9. Sigh.

Many thanks for the pointer to the other thread over at ruby core.

Regards,
Thomas.

The Pickaxe does cover a lot of the new encoding behavior.

James Edward Gray II

···

On Dec 15, 2008, at 6:10 AM, Brian Candler wrote:

But the upshot is that this is all pretty much undocumented so far.
(Well it might be documented in the 3rd ed Pickaxe, but I'm not buying
that yet)

Tom Link wrote:

text = File.read("text.txt", :encoding=>"ISO-8859-1")

Unfortunately, this isn't compatible with ruby 1.8. A script that uses
such a construct runs only with ruby 1.9. Sigh.

If all else fails, read the source.

I see that the encoding falls back to rb_default_external_encoding(),
which returns default_external, setting it if necessary from
rb_enc_from_index(default_external_index)

This in turn is set from rb_enc_set_default_external

This in turn is set from cmdline_options.ext.enc.name

And this in turn is set from the -E flag (or certain legacy settings on
-K). So:

$ ruby19 -E ISO-8859-1 -e 'puts File.open("/etc/passwd").gets.encoding'
ISO-8859-1

Yay. However, if it is possible to set the default external encoding
programatically (i.e. not via the command line options) I couldn't see
how.

···

--
Posted via http://www.ruby-forum.com/\.

Brian Candler wrote:

$ ruby19 -E ISO-8859-1 -e 'puts File.open("/etc/passwd").gets.encoding'
ISO-8859-1

D'oh. I see from original post that you knew this already.

It seems that Ruby keeps state for:
- default external encoding (e.g. for files being read in)
- default internal encoding (not sure what this is, you can set using -E
too but it defaults to nil)

and these are independent from the encodings of source files, which use
the magic comments to declare their encoding.

You can read these using Encoding.default_external and
Encoding.default_internal, but there don't appear to be setters for
them.

···

--
Posted via http://www.ruby-forum.com/\.

Ah, there is a preview here:

http://books.google.co.uk/books?id=jcUbTcr5XWwC&pg=PA359&lpg=PA359&dq=ruby+internal+encoding&source=web&ots=fHCpudaxhB&sig=iJ8JSJsNQV_t1KhZhHQqgjBfTuU&hl=en&sa=X&oi=book_result&resnum=4&ct=result#PPA358,M1

Something like this may do the trick:

text = File.open("..") do |f|
  f.set_encoding("ISO-8859-1") rescue nil
  f.read
end

But then you may as well just do:

text.force_encoding("ISO-8859-1") rescue nil

I'm not sure in which way the regexp is incompatible with the data read.
I would have thought that a US-ASCII regexp should be able to match
ISO-8859-1 data, and perhaps vice versa, but it seems not.

I can't really replicate without a hexdump of your text.txt. But it
would be interesting to see the result of:

text.each_line do |line|
    p line.encoding
    p /foo/.encoding
    p line =~ /foo/
end

Maybe what's really needed is a sort of "anti-/u" option which means "my
regexp literals are meant to match byte-at-a-time, not
character-at-a-time"

Anyway, I'm afraid all this increases my inclination to stick with ruby
1.8.6 :frowning:

···

--
Posted via http://www.ruby-forum.com/.

Default internal is the encoding IO objects will transcode incoming data into, by default. So you could set this for UTF-8 and then read from various different encodings (specifying each type in the open() call), but only work with Unicode in your script.

James Edward Gray II

···

On Dec 15, 2008, at 7:41 AM, Brian Candler wrote:

- default internal encoding (not sure what this is, you can set using -E
too but it defaults to nil)

I would have thought that a US-ASCII regexp should be able to match
ISO-8859-1 data, and perhaps vice versa, but it seems not.

It does:

$ ruby_dev -e 'p "résumé".encode("ISO-8859-1") =~ /foo/'
nil
$ ruby_dev -e 'p "résumé foo".encode("ISO-8859-1") =~ /foo/'
7

Maybe what's really needed is a sort of "anti-/u" option which means "my
regexp literals are meant to match byte-at-a-time, not
character-at-a-time"

That's what BINARY means.

Anyway, I'm afraid all this increases my inclination to stick with ruby
1.8.6 :frowning:

Perhaps it's a bit early to make this judgement since you've just started learning about the new system?

There's a lot going on here, so it's a lot to take in. In places, the behavior is a little complex. However, the core team has put a lot of effort into making the system easier to use. It's getting there.

Also, even in it's current draft form, the Pickaxe answers every question you've thrown at both mailing lists. Thus it should be a big help when you decide the time is right to pick it up.

James Edward Gray II

···

On Dec 15, 2008, at 7:55 AM, Brian Candler wrote:

James Gray wrote:

I would have thought that a US-ASCII regexp should be able to match
ISO-8859-1 data, and perhaps vice versa, but it seems not.

It does:

$ ruby_dev -e 'p "r�sum�".encode("ISO-8859-1") =~ /foo/'
nil
$ ruby_dev -e 'p "r�sum� foo".encode("ISO-8859-1") =~ /foo/'
7

I found that too, but was confused by the "broken US-ASCII string"
exception which the OP saw.

I suppose the external_encoding is defaulting to US-ASCII on that
system.

This means his program will break on every file passed into it which has
a character with the top bit set. You can argue that's "failsafe", in
the sense of bombing out rather than continuing processing with the
wrong encoding, and it therefore forces you to change your program or
the command-line args to specify the actual encoding in use.

However, that's pretty unforgiving. I can use Unix grep on a file with
unknown character set or broken UTF-8 characters and it works quite
happily.

Wouldn't it be kinder to default to BINARY if the encoding is
unspecified?

irb(main):011:0> s = "foo\xff\xff\xffbar".force_encoding("BINARY")
=> "foo\xFF\xFF\xFFbar"
irb(main):012:0> s =~ /foo/
=> 0

Maybe what's really needed is a sort of "anti-/u" option which means
"my
regexp literals are meant to match byte-at-a-time, not
character-at-a-time"

That's what BINARY means.

On the String side, yes.

I was thinking of an option on the Regexp: /foo/b or somesuch.
(In contrast to /foo/u in 1.8 meaning 'this Regexp matches unicode')

Or you can you set BINARY encoding on the Regexp too? I couldn't see
how.

···

--
Posted via http://www.ruby-forum.com/\.

There's a lot going on here, so it's a lot to take in. In places, the
behavior is a little complex. However, the core team has put a lot of
effort into making the system easier to use. It's getting there.

It would have been nice though if the defaults had been chosen so that
they don't break 1.8 scripts -- or use some 8bit clean encoding if the
data contains 8bit wide characters instead of throwing an error.

In article <AC6610E0-BA7A-498A-96E3-853617CAE2CF@grayproductions.net>,

···

James Gray <james@grayproductions.net> wrote:

Perhaps it's a bit early to make this judgement since you've just =20
started learning about the new system?

From what I've seen and experimented with 1.9 for a few months, my main gripe
is that the whole encoding support is overly complex. I know m17n is not
solved by the magic unicode wand but I'd love to have a more simple way.

--
Ollivier ROBERT -=- EEC/RIF/SEU -=-
Systems Engineering Unit

I think it's probably more important to get this encoding interface right than to worry about 1.8 compatibility. We knew 1.9 was going to break some things, so the time was right.

Also, if you've been using the -KU switch in Ruby 1.8 and working with UTF-8 data, 1.9 may work pretty well for you:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/19552

That's a pretty common "best practice" in the Ruby community, from what I've seen. Even Rails pushes this approach now.

If you have gone this way though, you may want to migrate to the even better -U switch in 1.9.

James Edward Gray II

···

On Dec 15, 2008, at 9:07 AM, Tom Link wrote:

There's a lot going on here, so it's a lot to take in. In places, the
behavior is a little complex. However, the core team has put a lot of
effort into making the system easier to use. It's getting there.

It would have been nice though if the defaults had been chosen so that
they don't break 1.8 scripts -- or use some 8bit clean encoding if the
data contains 8bit wide characters instead of throwing an error.

The default encoding is pulled from your environment: LANG or LC_CTYPE, I believe. This is very important and it makes simple scripting fit in well with the environment.

James Edward Gray II

···

On Dec 15, 2008, at 8:50 AM, Brian Candler wrote:

Wouldn't it be kinder to default to BINARY if the encoding is
unspecified?

Hi,

From what I've seen and experimented with 1.9 for a few months, my main gripe
is that the whole encoding support is overly complex. I know m17n is not
solved by the magic unicode wand but I'd love to have a more simple way.

The whole picture must be complex, since encoding support itself is
VERY complex indeed. History sucks. But for daily use, just remember
specifying encoding if you are not sure what is the default_encoding,
e.g.

  f = open(path, "r:iso-8859-1")

or

  f = open(path, "r", encoding: "iso-8859-1")

Simple? If you want to convert your data into Unicode every time you
read, just put -U at your shebang (#!) line, in addition.

              matz.

···

In message "Re: ruby 1.9.1: Encoding trouble: broken US-ASCII String" on Tue, 16 Dec 2008 00:47:37 +0900, Ollivier Robert <roberto@REMOVETHIS.eu.org> writes:

Wouldn't it be kinder to default to BINARY if the encoding is
unspecified?

The default encoding is pulled from your environment: LANG or
LC_CTYPE, I believe. This is very important and it makes simple
scripting fit in well with the environment.

The code seems to say:
- if an encoding is chosen in the environment but is unknown to Ruby,
  use ASCII-8BIT (aka BINARY)
- if Ruby was built on a system where it doesn't know how to ask the
  environment for a language, then use US-ASCII

So I would read from this that the OP has either fallen foul of the
US-ASCII fallback (e.g. no langinfo.h when building under Cygwin), or
else his environment has explicitly picked US-ASCII.

There must have been a good reason why US-ASCII was chosen, rather than
ASCII-8BIT, for systems without langinfo.h.

Regards,

Brian.

rb_locale_encoding(void)
{
    VALUE charmap = rb_locale_charmap(rb_cEncoding);
    int idx;

    if (NIL_P(charmap))
        idx = rb_usascii_encindex();
    else if ((idx = rb_enc_find_index(StringValueCStr(charmap))) < 0)
        idx = rb_ascii8bit_encindex();

    if (rb_enc_registered("locale") < 0) enc_alias("locale", idx);

    return rb_enc_from_index(idx);
}

...

VALUE
rb_locale_charmap(VALUE klass)
{
#if defined NO_LOCALE_CHARMAP
    return rb_usascii_str_new2("ASCII-8BIT");
#elif defined HAVE_LANGINFO_H
    char *codeset;
    codeset = nl_langinfo(CODESET);
    return rb_usascii_str_new2(codeset);
#elif defined _WIN32
    return rb_sprintf("CP%d", GetACP());
#else
    return Qnil;
#endif
}

···

--
Posted via http://www.ruby-forum.com/\.

Yukihiro Matsumoto wrote:

The whole picture must be complex, since encoding support itself is
VERY complex indeed. History sucks. But for daily use, just remember
specifying encoding if you are not sure what is the default_encoding,
e.g.

  f = open(path, "r:iso-8859-1")

It seems to go against DRY to have to write "r:binary" or "rb:binary"
when opening lots of binary files. But if I remember to use
#!/usr/bin/ruby -Knw everywhere that should be OK.

However, I also don't like the unstated assumption that all Strings
contain text.

In RFC2045 (MIME), there is a distinction made between 7bit text, 8bit
text, and binary data.

But if you label a string as "binary", Ruby changes this to
"ASCII-8BIT". I think that is a misrepresentation of that data, if it is
not actually ASCII-based text. I would much rather it made no assertion
about the content than a wrong assertion.

···

--
Posted via http://www.ruby-forum.com/\.

Also, if you've been using the -KU switch in Ruby 1.8 and working with
UTF-8 data, 1.9 may work pretty well for you

Well, I'm still stuck with latin-1. It's interesting though that
according to B Candler the fallback for unknown encodings should be 8-
bit clean and that US-ASCII should be only used as last resort. Maybe
it's just a cygwin thing?

Could we/I please get more information on how exactly the charset is
chosen depending on which environment variable and if this applies for
cygwin too? It appears to me that neither LANG nor LC_TYPE have any
effect on charset selection. But maybe I'm doing it wrong.

Regards,
Thomas.

You used to have to do that. In recent HEADS, rb sets binary encoding automatically (unless overridden).

Dave

···

On Dec 15, 2008, at 10:16 AM, Brian Candler wrote:

It seems to go against DRY to have to write "r:binary" or "rb:binary"
when opening lots of binary files. But if I remember to use
#!/usr/bin/ruby -Knw everywhere that should be OK.

Hi,

It seems to go against DRY to have to write "r:binary" or "rb:binary"
when opening lots of binary files. But if I remember to use
#!/usr/bin/ruby -Knw everywhere that should be OK.

However, I also don't like the unstated assumption that all Strings
contain text.

open(path, "rb") is your friend. It sets encoding to binary.

···

In message "Re: ruby 1.9.1: Encoding trouble: broken US-ASCII String" on Tue, 16 Dec 2008 01:16:55 +0900, Brian Candler <b.candler@pobox.com> writes:

In RFC2045 (MIME), there is a distinction made between 7bit text, 8bit
text, and binary data.

But if you label a string as "binary", Ruby changes this to
"ASCII-8BIT". I think that is a misrepresentation of that data, if it is
not actually ASCII-based text. I would much rather it made no assertion
about the content than a wrong assertion.
--
Posted via http://www.ruby-forum.com/\.