A question about Ruby 1.9's "external encoding"

Albert_Schlef · 20 March 2011 00:38

I have the following program:

  p Encoding.default_external
  File.open('testing', 'w') do |f|
    p f.external_encoding
  end

and when I run it I the following output:

#<Encoding:UTF-8>
nil

In other words, the file's "external encoding" is nil. What does this
mean? Shouldn't this be "UTF-8", the default external encoding?

BTW, "ruby1.9.1 -v" gives me:

ruby 1.9.1p378 (2010-01-10 revision 26273) [i486-linux]

I'm using Ubuntu 10.04.1, and that's the most updated version of Ruby
1.9.1.

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 20 March 2011 12:15

--------------------------------------------------- IO#external_encoding
io.external_encoding => encoding

From Ruby 1.9.1

···

On 03/20/2011 01:38 AM, Albert Schlef wrote:

I have the following program:

   p Encoding.default_external
   File.open('testing', 'w') do |f|
     p f.external_encoding
   end

and when I run it I the following output:

   #<Encoding:UTF-8>
   nil

In other words, the file's "external encoding" is nil. What does this
mean? Shouldn't this be "UTF-8", the default external encoding?

------------------------------------------------------------------------
      Returns the Encoding object that represents the encoding of the
      file. If io is write mode and no encoding is specified, returns
      +nil+.

I'd say it means that the default encoding is used.

BTW, "ruby1.9.1 -v" gives me:

ruby 1.9.1p378 (2010-01-10 revision 26273) [i486-linux]

I'm using Ubuntu 10.04.1, and that's the most updated version of Ruby
1.9.1.

irb(main):001:0> Encoding.default_external
Encoding.default_external Encoding.default_external=
irb(main):001:0> Encoding.default_external
=> #<Encoding:UTF-8>
irb(main):002:0> Encoding.default_internal
=> nil
irb(main):003:0> File.open("x","w"){|io| p io.external_encoding; io.puts "aä"}
nil
=> nil
irb(main):004:0> File.open("x","r:UTF-8"){|io| p io.external_encoding; io.read}
#<Encoding:UTF-8>
=> "aä\n"
irb(main):005:0>

Apparently the file *is* encoded in UTF-8 because I can read it without errors and get what I expect.

Kind regards

robert

Brian_Candler · 20 March 2011 13:17

Albert Schlef wrote in post #988363:

I have the following program:

  p Encoding.default_external
  File.open('testing', 'w') do |f|
    p f.external_encoding
  end

and when I run it I the following output:

  #<Encoding:UTF-8>
  nil

In other words, the file's "external encoding" is nil. What does this
mean? Shouldn't this be "UTF-8", the default external encoding?

Depends what you mean by "shouldn't be". The rules for encodings in ruby
1.9 are (IMO) arbitrary and inconsistent.

In the case of external encodings: yes, they default to nil for files
opened in write mode. This means that no transcoding is done on output.
For example, if you have a String which happens to contain binary, or
ISO-8859-1, it will be written out unchanged (i.e. the sequence of bytes
in the String is the same sequence of bytes which will end up in the
file).

If you want to transcode on output, you have to set the external
encoding explicitly.

Since none of this is documented anywhere officially, I attempted to
reverse engineer it. I've documented about 200 behaviours here:

github.com

candlerb/string19/blob/master/string19.rb

#!/usr/bin/env ruby19
# encoding: UTF-8
# This document is Copyright (C) Brian Candler 2009 and released under a
# Creative Commons Attribution-NonCommercial 3.0 Unported License.

############# CONTENTS ###################

# -1. PREAMBLE
#  0. INTRODUCTION
#  1. ENCODINGS
#  2. PROPERTIES OF ENCODINGS
#  3. STRING, FILE AND REGEXP ENCODINGS
#  4. VALID AND FIXED ENCODINGS
#  5. COMPATIBLE OBJECTS
#  6. STRING CONCATENATION
#  7. THE BINARY / ASCII-8BIT ENCODING
#  8. SINGLE CHARACTERS
#  9. EQUALITY AND COLLATION
# 10. HASH AND EQL?
# 11. UPPER AND LOWER CASE

This file has been truncated. show original

For my own code, I still use ruby 1.8 exclusively.

···

--
Posted via http://www.ruby-forum.com/\.

Brian_Candler · 20 March 2011 13:19

Robert K. wrote in post #988404:

--------------------------------------------------- IO#external_encoding
      io.external_encoding => encoding

      From Ruby 1.9.1
------------------------------------------------------------------------
      Returns the Encoding object that represents the encoding of the
      file. If io is write mode and no encoding is specified, returns
      +nil+.

I'd say it means that the default encoding is used.

No, it doesn't.

Apparently the file *is* encoded in UTF-8 because I can read it without
errors

ruby 1.9 does not give errors if you read a file which is not UTF-8
encoded with the external encoding is UTF-8. You will just get strings
with valid_encoding? false.

It will give errors if you attempt UTF-8 regexp matches on the data
though.

The rules for which methods give errors and which don't are pretty odd.
For example, string[n] doesn't give an exception, even if the string is
invalid.

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 20 March 2011 16:35

Robert K. wrote in post #988404:

--------------------------------------------------- IO#external_encoding
       io.external_encoding => encoding

       From Ruby 1.9.1
------------------------------------------------------------------------
       Returns the Encoding object that represents the encoding of the
       file. If io is write mode and no encoding is specified, returns
       +nil+.

I'd say it means that the default encoding is used.

No, it doesn't.

So, which encoding is used then? An encoding *has* to be used because you cannot write to a file without a particular encoding. There needs to be a defined mapping between character data and bytes in the file.

Apparently the file *is* encoded in UTF-8 because I can read it without
errors

ruby 1.9 does not give errors if you read a file which is not UTF-8
encoded with the external encoding is UTF-8. You will just get strings
with valid_encoding? false.

I could see in the console that the file was read properly. Also:

irb(main):001:0> File.open("x","w"){|io| p io.external_encoding; io.puts "aä"}
nil
=> nil
irb(main):002:0> s = File.open("x","r:UTF-8"){|io| p io.external_encoding; io.read}
#<Encoding:UTF-8>
=> "aä\n"
irb(main):003:0> s.valid_encoding?
=> true
irb(main):004:0>

It will give errors if you attempt UTF-8 regexp matches on the data
though.

The rules for which methods give errors and which don't are pretty odd.
For example, string[n] doesn't give an exception, even if the string is
invalid.

I would concede that encodings in Ruby are pretty complex. It's easier in Java where String never has a particular encoding and only reading and writing uses encodings. However, Java's Strings were not capable of handling all Asian symbols as I have learned on this list. Since 1.5 they managed to increase the range of Unicode codepoints which can be covered - at the cost of making String handling a mess:

http://download.oracle.com/javase/6/docs/api/java/lang/String.html#codePointAt(int)

Now suddenly String.length() no longer returns the length in real characters (code points) but rather the length in chars. I figure, Ruby's solution might not be so bad after all.

Kind regards

robert

···

On 20.03.2011 14:19, Brian Candler wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Brian_Candler · 20 March 2011 17:39

Robert K. wrote in post #988429:

I'd say it means that the default encoding is used.

No, it doesn't.

So, which encoding is used then?

None.

An encoding *has* to be used because
you cannot write to a file without a particular encoding.

Untrue. In Unix, read() and write() just work on sequences of bytes, and
have no concept of encoding.

Perhaps you are thinking of a language like Python 3, where there is a
distinction between "characters" and "bytes representing those
characters" (maybe Java has that distinction too, I don't know enough
about Java to say)

In ruby 1.9, every String is a bunch of bytes plus an encoding tag. When
you write this out to a file, and the external encoding is nil, then
just the bytes are written, and the encoding is ignored.

I could see in the console that the file was read properly.

What you see in the console in irb does not necessarily mean much in
ruby 1.9, because STDOUT.external_encoding is nil by default too.

irb(main):001:0> File.open("x","w"){|io| p io.external_encoding; io.puts
"aä"}
nil
=> nil
irb(main):002:0> s = File.open("x","r:UTF-8"){|io| p
io.external_encoding; io.read}
#<Encoding:UTF-8>
=> "aä\n"
irb(main):003:0> s.valid_encoding?
=> true

Now, that's more complex, and *does* show that the data is valid UTF-8.
(I wasn't arguing that it wasn't; I was arguing that your logic was
flawed, because even if the data were not valid UTF-8, your program
would have run without raising an error. Therefore the fact that it runs
without error is insufficient to show that the data is valid UTF-8)

[In Java]

Now suddenly String.length() no longer returns the length in real
characters (code points) but rather the length in chars. I figure,
Ruby's solution might not be so bad after all.

Of course, even in Unicode, the number of code points is not necessarily
the same as the number of glyphs or "printable characters".

···

On 20.03.2011 14:19, Brian Candler wrote:

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 21 March 2011 15:00

Robert K. wrote in post #988429:

I'd say it means that the default encoding is used.

No, it doesn't.

So, which encoding is used then?

None.

Even if no encoding is used explicitly an encoding must be used
nevertheless (see below).

An encoding *has* to be used because
you cannot write to a file without a particular encoding.

Untrue. In Unix, read() and write() just work on sequences of bytes, and
have no concept of encoding.

Perhaps you are thinking of a language like Python 3, where there is a
distinction between "characters" and "bytes representing those
characters" (maybe Java has that distinction too, I don't know enough
about Java to say)

In ruby 1.9, every String is a bunch of bytes plus an encoding tag. When
you write this out to a file, and the external encoding is nil, then
just the bytes are written, and the encoding is ignored.

Which basically means that the string's own encoding is used. If you
have a number of bytes and want to interpret them as characters you
must use an encoding, even if it is 8 bit ASCII and there is no
conversion going on. There is no such thing as a text file without
encoding whether applied explicitly or not. On one side there are
bytes and on the other side there are character codes (or Unicode code
points).

I could see in the console that the file was read properly.

What you see in the console in irb does not necessarily mean much in
ruby 1.9, because STDOUT.external_encoding is nil by default too.

irb(main):001:0> File.open("x","w"){|io| p io.external_encoding; io.puts
"aä"}
nil
=> nil
irb(main):002:0> s = File.open("x","r:UTF-8"){|io| p
io.external_encoding; io.read}
#<Encoding:UTF-8>
=> "aä\n"
irb(main):003:0> s.valid_encoding?
=> true

Now, that's more complex, and *does* show that the data is valid UTF-8.
(I wasn't arguing that it wasn't; I was arguing that your logic was
flawed, because even if the data were not valid UTF-8, your program
would have run without raising an error. Therefore the fact that it runs
without error is insufficient to show that the data is valid UTF-8)

So what we learn here is that since my original string had encoding
UTF-8 the encoding of the file happened to be UTF-8 as well. That
basically means that by accident we can get a file with mixed encoding
content. Shudder.

Here's the test:

s = "aä"

=> "aä"

s.encoding

=> #<Encoding:UTF-8>

s = s.encode 'ISO-8859-1'

=> "a\xE4"

s.encoding

=> #<Encoding:ISO-8859-1>

Encoding.default_external

=> #<Encoding:UTF-8>

$stdout.external_encoding

=> nil

File.open("x","w"){|io| p io.external_encoding; io.puts(s)}

nil
=> nil

t = File.open("x","r:UTF-8"){|io| p io.external_encoding; io.read}

#<Encoding:UTF-8>
=> "a\xE4\n"

t.encoding

=> #<Encoding:UTF-8>

t.valid_encoding?

=> false

t.length

=> 3

Now let's fix it

t.force_encoding 'ISO-8859-1'

=> "a\xE4\n"

t.encoding

=> #<Encoding:ISO-8859-1>

t.valid_encoding?

=> true

Output:

$stdout.external_encoding

=> nil

$stdout.puts t

a▒
=> nil

$stdout.set_encoding($stdin.external_encoding)

=> #<IO:<STDOUT>>

$stdout.external_encoding

=> #<Encoding:UTF-8>

$stdout.puts t

aä
=> nil

For me this boils down to these rules:

1. Strings are sequences of bytes

2. Strings have an associated encoding which does not need to match
the actual encoding of the binary content

3. In absence of a target (external or internal, depending on
direction) encoding IO operations use a String's binary data as is,
otherwise they try to convert between encodings and raise an error if
that is not possible.

Cheers

robert

···

On Sun, Mar 20, 2011 at 6:39 PM, Brian Candler <b.candler@pobox.com> wrote:

On 20.03.2011 14:19, Brian Candler wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Markus_Fischer · 22 March 2011 23:24

Is it just me or does this, especially point 2, sound highly confusing
if not dangerous?

- Markus

···

On 21.03.2011 16:00, Robert Klemme wrote:

For me this boils down to these rules:

1. Strings are sequences of bytes

2. Strings have an associated encoding which does not need to match
the actual encoding of the binary content

3. In absence of a target (external or internal, depending on
direction) encoding IO operations use a String's binary data as is,
otherwise they try to convert between encodings and raise an error if
that is not possible.

Robert_K1 · 23 March 2011 09:16

The rule as such is pretty clear IMHO. It does not meet "naive"
expectations and as such probably violates POLS (although Matz's
expectations are almost certainly different than ours - especially
since his native language has a much richer set of symbols than
western languages).

What *I* find slightly puzzling is this:

irb(main):001:0> s1 = "a"
=> "a"
irb(main):002:0> s1.encoding
=> #<Encoding:UTF-8>
irb(main):003:0> s2 = s1.encode 'ISO-8859-1'
=> "a"
irb(main):004:0> s2.encoding
=> #<Encoding:ISO-8859-1>
irb(main):005:0> s1 == s2
=> true
irb(main):006:0> s1.eql? s2
=> true
irb(main):007:0> [s1.hash, s2.hash]
=> [1003075638, 1003075638]
irb(main):008:0> [s1.hash, s2.hash].uniq
=> [1003075638]
irb(main):009:0> s1.encoding == s2.encoding
=> false

Apparently only the byte representation is used for equivalence checks
and the encoding is ignored. I guess this is a pragmatic optimization
for speed since

1. string comparisons are _very_ frequent

2. often strings with different encodings do also have different
binary representation (the fact that UTF-8 and ISO-8859-1 share the
common subset of ASCII 7 bit might be viewed as a special case).

irb(main):010:0> s1 = "ä"
=> "ä"
irb(main):011:0> s2 = s1.encode 'ISO-8859-1'
=> "\xE4"
irb(main):012:0> s1 == s2
=> false
irb(main):013:0> s1.eql? s2
=> false
irb(main):014:0> [s1.hash, s2.hash].uniq
=> [-276501091, 359342273]

If you include the encoding in equivalence check "s1 == s2" would
yield false in the first case (IRB line 005) although both strings
actually represent the same character sequence. The proper solution
of course would be to compare two strings on the character level but
since this would make decoding the byte sequence necessary performance
would be worse and we collide with item 1 above.

I think you can write proper locale aware programs in Ruby (mostly be
specifying internal and external encodings). But, as in all
languages, you must be aware of the fact that you need to explicitly
deal with encodings. The fact remains that i18n is a complex topic
because human cultures and languages are so vastly different. And the
complexity does not go away because it is inherent in the matter - no
matter what technical solutions you invent. Given that, the possible
discrepancy between the byte data and the encoding (which manifests
itself in the existence of String#valid_encoding?) does look a lot
smaller already.

For even more information and detail I recommend James's excellent article at

And there's more to be found here

Oh, and while we're at it, maybe we should add a method like this to String:

class String
  def ensure_encoding
    raise Encoding::InvalidByteSequenceError, "Wrong encoding for %p"
% self unless valid_encoding?
    self
  end
end

Then we can do something like

puts s.ensure_encoding.length

or other String operations and be sure that the encoding is proper.
Does anybody have a better (shorter) name for such a method?

Kind regards

robert

···

On Wed, Mar 23, 2011 at 12:24 AM, Markus Fischer <markus@fischer.name> wrote:

On 21.03.2011 16:00, Robert Klemme wrote:

For me this boils down to these rules:

1. Strings are sequences of bytes

2. Strings have an associated encoding which does not need to match
the actual encoding of the binary content

3. In absence of a target (external or internal, depending on
direction) encoding IO operations use a String's binary data as is,
otherwise they try to convert between encodings and raise an error if
that is not possible.

Is it just me or does this, especially point 2, sound highly confusing
if not dangerous?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Albert_Schlef · 23 March 2011 11:59

Robert K. wrote in post #988839:

What *I* find slightly puzzling is this:

irb(main):001:0> s1 = "a"
=> "a"
irb(main):002:0> s1.encoding
=> #<Encoding:UTF-8>
irb(main):003:0> s2 = s1.encode 'ISO-8859-1'
=> "a"
irb(main):004:0> s2.encoding
=> #<Encoding:ISO-8859-1>
irb(main):005:0> s1 == s2
=> true
irb(main):006:0> s1.eql? s2
=> true
irb(main):007:0> [s1.hash, s2.hash]
=> [1003075638, 1003075638]
irb(main):008:0> [s1.hash, s2.hash].uniq
=> [1003075638]
irb(main):009:0> s1.encoding == s2.encoding
=> false

Apparently only the byte representation is used for equivalence checks
and the encoding is ignored.

I don't think this is true:

irb(main):043:0> utf = "\u05D0" # Alef
=> "א"
irb(main):044:0> latin = utf.dup; latin.force_encoding 'ISO-8859-1'
=> "�\x90"
irb(main):045:0> [utf.bytes.to_a, latin.bytes.to_a] # They have the
same bytes
=> [[215, 144], [215, 144]]
irb(main):048:0> [utf.valid_encoding?, latin.valid_encoding?] # And are
ok
=> [true, true]
irb(main):046:0> utf == latin # But they aren't equal
=> false

In your case it's good the strings are considered equal: we want to know
if the letters are all the same. "a" is "a"... no matter what encoding.

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 23 March 2011 12:27

Robert K. wrote in post #988839:

What *I* find slightly puzzling is this:

irb(main):001:0> s1 = "a"
=> "a"
irb(main):002:0> s1.encoding
=> #<Encoding:UTF-8>
irb(main):003:0> s2 = s1.encode 'ISO-8859-1'
=> "a"
irb(main):004:0> s2.encoding
=> #<Encoding:ISO-8859-1>
irb(main):005:0> s1 == s2
=> true
irb(main):006:0> s1.eql? s2
=> true
irb(main):007:0> [s1.hash, s2.hash]
=> [1003075638, 1003075638]
irb(main):008:0> [s1.hash, s2.hash].uniq
=> [1003075638]
irb(main):009:0> s1.encoding == s2.encoding
=> false

Apparently only the byte representation is used for equivalence checks
and the encoding is ignored.

I don't think this is true:

irb(main):043:0> utf = "\u05D0" # Alef
=> "א"
irb(main):044:0> latin = utf.dup; latin.force_encoding 'ISO-8859-1'
=> "�\x90"
irb(main):045:0> [utf.bytes.to_a, latin.bytes.to_a] # They have the
same bytes
=> [[215, 144], [215, 144]]
irb(main):048:0> [utf.valid_encoding?, latin.valid_encoding?] # And are
ok
=> [true, true]
irb(main):046:0> utf == latin # But they aren't equal
=> false

Thanks for the interesting example! I noticed:

irb(main):008:0> utf.length
=> 1
irb(main):009:0> latin.length
=> 2

In your case it's good the strings are considered equal: we want to know
if the letters are all the same. "a" is "a"... no matter what encoding.

Turns out the encoding is considered in comparison (read bottom up):

int
rb_str_comparable(VALUE str1, VALUE str2)
{
int idx1, idx2;
int rc1, rc2;

    if (RSTRING_LEN(str1) == 0) return TRUE;
    if (RSTRING_LEN(str2) == 0) return TRUE;
    idx1 = ENCODING_GET(str1);
    idx2 = ENCODING_GET(str2);
    if (idx1 == idx2) return TRUE;
    rc1 = rb_enc_str_coderange(str1);
    rc2 = rb_enc_str_coderange(str2);
    if (rc1 == ENC_CODERANGE_7BIT) {
        if (rc2 == ENC_CODERANGE_7BIT) return TRUE;
        if (rb_enc_asciicompat(rb_enc_from_index(idx2)))
            return TRUE;
    }
    if (rc2 == ENC_CODERANGE_7BIT) {
        if (rb_enc_asciicompat(rb_enc_from_index(idx1)))
            return TRUE;
    }
    return FALSE;
}

/* expect tail call optimization */
static VALUE
str_eql(const VALUE str1, const VALUE str2)
{
const long len = RSTRING_LEN(str1);

    if (len != RSTRING_LEN(str2)) return Qfalse;
    if (!rb_str_comparable(str1, str2)) return Qfalse;
    if (memcmp(RSTRING_PTR(str1), RSTRING_PTR(str2), len) == 0)
        return Qtrue;
    return Qfalse;
}

VALUE
rb_str_equal(VALUE str1, VALUE str2)
{
    if (str1 == str2) return Qtrue;
    if (TYPE(str2) != T_STRING) {
        if (!rb_respond_to(str2, rb_intern("to_str"))) {
            return Qfalse;
        }
        return rb_equal(str2, str1);
    }
    return str_eql(str1, str2);
}

Now, everything is clear.

Cheers

robert

···

On Wed, Mar 23, 2011 at 12:59 PM, Albert Schlef <albertschlef@gmail.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
Default encoding UTF 8? ruby-talk	2	106	25 May 2010
Ruby 1.9.1: Encoding trouble: broken US-ASCII String ruby-talk	21	199	16 December 2008
Ruby 1.9 - US-ASCII vs UTF-8 ruby-talk	2	149	19 December 2009
[ruby 1.9] reading an UTF-8 encoded file ruby-talk	12	197	11 March 2010
Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files ruby-talk	2	140	18 November 2010

A question about Ruby 1.9's "external encoding"

Related topics