A million reasons why Encoding was a mistake

7stud2 · 15 May 2012 13:11

Newcomer wants to try Ruby.

Newcomer comes on IRC and asks for help.

Newcomer is on Windows.

Newcomer installs Ruby 1.9.3 and manages to run it. IRB works for him.

Newcomer then installs radiant via:

gem install radiant

Newcomer runs into problems:

http://paste.ubuntu.com/988874/

The relevant part:

C:/Ruby193/lib/ruby/site_ruby/1.9.1/rubygems/defaults.rb:64:in `exist?':
"\x8D" to UTF-8 in conversion from Windows-1254 to UTF-8
(Encoding::UndefinedConversion Error) from
C:/Ruby193/lib/ruby/site_ruby/1.9.1/rubygems/defaults.rb:64:in `def
ault_path'

It is a known fact that I do not like the Encoding situation in Ruby at
all. But I am old and can deal with myself being grumpy.

What about newcomers though? This is about the first impression
they get from Ruby.

Do we really want to send them over the cliff into Encoding
problems like this?

Do we really want to force newcomers to deal with this mess?

How are they supposed to resolve problems like this ON THEIR OWN?

Back then, years ago, matz said that he wants to put the emphasis of
Ruby on "how programmers feel" when using a programming language. And I
learned ruby because of that interview back then.

Ruby is still absolutely awesome.

But the Encoding situation, if newcomers are confronted with this,
simply kills the fun they may have had when trying to use and code in
Ruby.

I'd appreciate it if anyone can tell me how they can resolve situations
like this, because I also recommended to use rubyinstaller. But the next
time, I will simply recommend people to stick to 1.8.7 instead, because
that ruby version is the better ruby.

The real ruby.

···

--
Posted via http://www.ruby-forum.com/.

11142 · 15 May 2012 15:36

Alright. It sounds like you're trolling, but I'll answer anyway, so
someone uninformed won't read your ramblings and think you're right.
This is my first and last post in this thread.

First, your whole premise of "newcomer installing radiant" is flawed;
I have never heard of radiant, whatever it is, before, so it's
probably some niche tool and unlikely someone will install it as the
first thing.

Second, it just installed perfectly for me, and I'm on Windows XP. My
codepage is CP852, and system encoding CP1250. If you're expecting
someone to debug your problem, probvide more information; otherwise
please, stop filling the list with noise.

Third, there is a set of simple rules to make your work on 1.9.3
painless; I have posted it on this list before, so I won't repeat
myself; you should use it yourself, and radiant should use it as well;
if a library doesn't install on certain setups, it's the library's
fault, not Ruby's.

As I said, that's all I have to say here. I won't be replying to this thread.

-- Matma Rex

Ryan_Davis1 · 15 May 2012 20:12

I notice that you nor anyone else has filed this issue on radiant's issue tracker on github...

So either:

1) It isn't real and you're just trolling.

2) It isn't half as common as you think it is... But you're still trolling because you're the experienced developer and you SHOULD have filed an issue.

···

On May 15, 2012, at 06:11 , Marc Heiler wrote:

Newcomer wants to try Ruby.

Newcomer comes on IRC and asks for help.

Newcomer is on Windows.

Newcomer installs Ruby 1.9.3 and manages to run it. IRB works for him.

Newcomer then installs radiant via:

gem install radiant

Newcomer runs into problems:

http://paste.ubuntu.com/988874/

Quintus · 15 May 2012 19:00

Just to clarify this, Radiant is used as the CMS on ruby-lang.org, so
it’s probably not _that_ kind of a niche tool*. But I agree, for a
newcomer installing this is nonsense.

Vale,
Marvin

* Disclaimer: I’m one of the maintainers of the German translation of
ruby-lang.org.

···

Am 15.05.2012 17:36, schrieb Bartosz Dziewoński:

First, your whole premise of "newcomer installing radiant" is flawed;
I have never heard of radiant, whatever it is, before, so it's
probably some niche tool and unlikely someone will install it as the
first thing.

7stud2 · 16 May 2012 13:02

I will add that the OP is not entirely alone in his opinion.

The issue is not so much that radiant is broken, but that ruby 1.9 is.
It has both a broken philosophy (that strings of bytes should always be
treated as characters in some dynamic encoding) and a half-baked
implementation (e.g. which can't convert accented characters from
uppercase to lowercase). There also remains a total lack of official
specification or documentation. I speak as someone who has attempted to
reverse-engineer it.

As for "there is a set of simple rules to make your work on 1.9.3
painless", what this really is saying is "to make your program run
reliably under ruby 1.9 you have to do magic incantations W, X, Y and
Z", none of which was necessary in ruby 1.8. What's worse is that
without these incantations, your app or library or test suite may run
just fine on your machine, but crash on someone else's, as illustrated
by the OP.

With ruby 1.9, writing correct programs and giving them sufficient test
coverage is more work than it was before. Do you run your test suite
multiple times with different settings of LC_ALL in the environment? If
not, you should.

···

--
Posted via http://www.ruby-forum.com/.

Matt_Mencel · 16 May 2012 14:25

Hi,

I'm not an OOP person. Most of the Ruby I write is just scripts and such. I've found a good place to start using real Class objects though and am having a hard time getting my head around it.

I have this class. The @members attribute is an array of several 5 digit numbers (could be between 1 and 10 values in the array). I want to be able to find objects in the Class by using one of the numbers.

class Crosslisting
attr_accessor :members, :title, :id

  def initialize(members, name='', id='')
    @members = members
    @name = name
    @id = id
  end

  def self.find_by_member(member)
    found = nil
    ObjectSpace.each_object(Crosslisting) { |o|
      found = o if o.members.include?(member)
    }
    found
  end
end

members = ['12345','67890']
my_obj = Crosslisting.new(members)
ap Crosslisting.find_by_member('12345')

# OUTPUT USING AWESOME_PRINT GEM
#<Crosslisting:0x00e77410
    attr_accessor :id = "",
    attr_accessor :members = [
        [0] "12345",
        [1] "67890"
    ],
    attr_accessor :name = ""

This seems to work ok. Am I doing it right or is there a better way? I tried to figure out how to use 'def initialize(*args)' instead of naming all the attributes, but I'm not currently smart enough to make it work apparently.

I thought about trying to use Hashes and symbols...maybe that's a better way or maybe not....but again haven't figured out how to make it work....something like...
my_obj = Crosslisting.new(:member => members)

Thanks,
Matt

Kaspar_Schiess · 17 May 2012 05:38

Ruby 1.9 works marvels with encoding. I don't agree with Brian at all. Thanks for all the hard work and the courage to make tough decisions that piss people off.

There's always other languages if you want that choice made differently...

1.8 was unfinished in that regard, I really wonder why people get so nostalgic about it. I had a lot of encoding trouble back in those days which are just gone now.

kaspar

Robert_K1 · 18 May 2012 08:02

The issue is not so much that radiant is broken, but that ruby 1.9 is.
It has both a broken philosophy (that strings of bytes should always be
treated as characters in some dynamic encoding)

I think you are being unfair: 1.9 has to deal with the history and
actually 1.8's was broken because of its weak i18n support. 1.9 tries
to evolve from that basis. If I set encodings properly on streams and
$stdin, $stderr and $stdout things work just fine.

Also, if 1.9 was really that broken we would be seeing much more
postings with encoding issues here. But apparently most people get by
with 1.9 pretty well which I would take as data point indicating that
it cannot be completely broken as you suggest.

You can even ignore the fact that internally String stores data as
bytes. For many applications this is just an implementation detail of
String. The encoding is typically not dynamic because one normally
does not use #force_encoding which will simply unconditionally
overwrite the encoding.

and a half-baked
implementation (e.g. which can't convert accented characters from
uppercase to lowercase).

Well, that's just lacking completeness in a feature - you could also
call it a "bug". But that is something different than "broken
philosophy".

There also remains a total lack of official
specification or documentation. I speak as someone who has attempted to
reverse-engineer it.

I get by pretty well with the current situation. There's also what
James wrote at

As for "there is a set of simple rules to make your work on 1.9.3
painless", what this really is saying is "to make your program run
reliably under ruby 1.9 you have to do magic incantations W, X, Y and
Z", none of which was necessary in ruby 1.8. What's worse is that
without these incantations, your app or library or test suite may run
just fine on your machine, but crash on someone else's, as illustrated
by the OP.

You make it sound like witch magic. But it isn't. My set of rules is
pretty small:

1. Take care of encodings when opening files (i.e. set encoding).
2. Convert all Strings that need to be compared to the same encoding.
(For that setting Encoding.default_internal is often sufficient).
3. Test (but that's a general rule)

The problems usually occur because file metadata does not contain
encoding information. So it must come from somewhere else. And that
process is not generally standardized. We have encoding information
in HTTP and MIME but that is lost once a file is stored somewhere and
one does not take special measures to keep that information. But this
is a general issue and cannot be attributed to 1.9.

With ruby 1.9, writing correct programs and giving them sufficient test
coverage is more work than it was before. Do you run your test suite
multiple times with different settings of LC_ALL in the environment? If
not, you should.

Well, obviously: there is a new feature (encoding) which affects *all*
IO. 1.8 versions didn't have it and would happily treat anything as
proper string even though it wasn't properly encoded. The fact that
they "just worked" does not mean that they necessarily worked correct.
1.9 is as easy to use when constantly in the same locale environment.
If something is to be released into the public then of course more
testing should be done.

Cheers

robert

···

On Wed, May 16, 2012 at 3:02 PM, Brian Candler <lists@ruby-forum.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Austin_Ziegler5 · 20 May 2012 20:37

The OP may not be alone in his opinion, but that's because encodings
are broken in general.

This is *not* a Ruby problem, this is a *data* problem.

C gets it wrong because it assumes that characters, code points, and
bytes are the same (but it gets a pass because it was created in a
time when this was true).

Java gets it wrong because it uses a nominally-UTF-16 character width
(it's actually UCS-2) which doesn't allow for UTF-16 surrogates.

Python and Java get it wrong because they always assume that Unicode
is a safe, reliable, and reversible transformation (and they don't
work well with non-Unicode encoding).

The problem the OP had? Partially a library problem for not shifting
to Ruby 1.9 assumptions (I've slowly been moving my libraries to state
their encodings up front, but it's a pain because I just don't have
the time).

Matz and others have worked very hard to make sure that Ruby 1.9 works
well if you follow certain rules regarding your inputs and outputs.
These rules, by the way, are more or less what Joel Spolsky wrote
almost nine years ago:
http://www.joelonsoftware.com/articles/Unicode.html

If you don't respect your encodings, they will bite you. They may not
bite you up front (as they do with Ruby, because it exposes these
things which are painful), but they *will* bite you.

Ruby got it right, because it acknowledges that (a) this is hard and
(b) gives you the tools you need in order to make this less painful.
It also doesn't (c) incorrectly assume that everything is or can be
expressed safely in Unicode. (Shift-JIS will not roundtrip to Unicode
and back for some characters.)

-a

···

On Wed, May 16, 2012 at 9:02 AM, Brian Candler <lists@ruby-forum.com> wrote:

I will add that the OP is not entirely alone in his opinion.

--
Austin Ziegler • halostatue@gmail.com • austin@halostatue.ca
http://www.halostatue.ca/ • http://twitter.com/halostatue

Robert_K1 · 16 May 2012 15:13

Hi,

I'm not an OOP person. Most of the Ruby I write is just scripts and such. I've found a good place to start using real Class objects though and am having a hard time getting my head around it.

I have this class. The @members attribute is an array of several 5 digit numbers (could be between 1 and 10 values in the array). I want to be able to find objects in the Class by using one of the numbers.

class Crosslisting
attr_accessor :members, :title, :id

def initialize(members, name='', id='')
@members = members
@name = name
@id = id
end

def self.find_by_member(member)
found = nil
ObjectSpace.each_object(Crosslisting) { |o|
found = o if o.members.include?(member)
}
found
end
end

members = ['12345','67890']
my_obj = Crosslisting.new(members)
ap Crosslisting.find_by_member('12345')

# OUTPUT USING AWESOME_PRINT GEM
#<Crosslisting:0x00e77410
attr_accessor :id = "",
attr_accessor :members = [
[0] "12345",
[1] "67890"
],
attr_accessor :name = ""

This seems to work ok. Am I doing it right or is there a better way?

ObjectSpace is always a crutch. Better remember all elements
somewhere (e.g. in an instance variable of the class).

class Crosslisting
attr_accessor :members, :title, :id

def initialize(members, name='', id='')
   @members = members
   @name = name
   @id = id
   self.class.add(self)
end

def self.add(cl)
(@members ||= ) << cl
end

def self.find_by_member(member)
@members.find {|o| o.members.include?(member)} if @members
end
end

I tried to figure out how to use 'def initialize(*args)' instead of naming all the attributes, but I'm not currently smart enough to make it work apparently.

Why would you want to do that? The code of #initialize is OK the way it is.

Kind regards

robert

···

On Wed, May 16, 2012 at 4:25 PM, Matt Mencel <MR-Mencel@wiu.edu> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert_K1 · 21 May 2012 08:00

I will add that the OP is not entirely alone in his opinion.

The OP may not be alone in his opinion, but that's because encodings
are broken in general.

This is *not* a Ruby problem, this is a *data* problem.

I couldn't agree more.

C gets it wrong because it assumes that characters, code points, and
bytes are the same (but it gets a pass because it was created in a
time when this was true).

And at least in C++ there are measures for multibyte characters.

Java gets it wrong because it uses a nominally-UTF-16 character width
(it's actually UCS-2) which doesn't allow for UTF-16 surrogates.

Actually the situation with Java is even worse: at some point (i.e.
with Java 5) they decided to add methods for dealing with all the
Unicode code points which cannot be represented with 16 bit. See here
for example:
http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isDigit(int)

Unfortunately they had to maintain compatibility to the old 16 bit
char type and hence used int for 32 bit code point representation. I
am not sure how often these new methods are actually used but my guess
would be: rarely. If you use them, code will become messy soon...

[...]
Ruby got it right, because it acknowledges that (a) this is hard and
(b) gives you the tools you need in order to make this less painful.
It also doesn't (c) incorrectly assume that everything is or can be
expressed safely in Unicode. (Shift-JIS will not roundtrip to Unicode
and back for some characters.)

+1 to this and your other statements.

Kind regards

robert

···

On Sun, May 20, 2012 at 10:37 PM, Austin Ziegler <halostatue@gmail.com> wrote:

On Wed, May 16, 2012 at 9:02 AM, Brian Candler <lists@ruby-forum.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

7stud2 · 22 May 2012 07:37

Austin Ziegler wrote in post #1061436:

This is *not* a Ruby problem, this is a *data* problem.

Leaving aside the point that not all data is text, you still need a
clear conceptual model to be able to reason about your program.

In Python 3, there is a clear distinction between "characters" and "a
sequence of bytes which encode those characters". They are two
completely different classes and cannot be combined (e.g. a+b will
always fail if a is str and b is bytes). It's also symmetrical: you
convert from bytes to characters as text enters your program, and from
characters to bytes as text leaves it.

(Aside: I know that Python only supports Unicode characters, but this is
just an implementation limitation. There could be a third class
"gb2312str" if desired, and additional classes for other character sets
which are not subsets of Unicode)

Ruby muddles these concepts by having all strings be a sequence of bytes
plus the encoding, which in turn muddles the concepts of "character set"
and "a method of encoding that character set".

Now, you could argue that Ruby is actually implementing the Python 3
approach but in a "lazy" way: by not explicitly converting bytes to
characters until required, it avoids potentially unnecessary work. But
if so, it's half-baked. For example, you cannot combine a UTF16-LE
string with a UTF16-BE string, even though they are the same character
set (Unicode). What's worse is that a UTF16-LE string will sort
differently to a UTF16-BE string (because ruby 1.9 sorts by byte
ordering, which happens to work for UTF8 but not all other encodings of
Unicode). So it kind-of behaves like a string of characters, except that
it doesn't.

Furthermore, ruby sometimes lets you combine objects representing
"characters" and "bytes", or "characters with encoding A" and
"characters with encoding B". Whether it is allowed or not depends on
the run-time contents of those objects.

If a = b + c *always* crashed when b and c had different encodings, I
would really not have a problem with any of this. Your test case would
immediately catch it, you fix it, problem solved.

However ruby 1.9's insidious behaviour means that b + c may *or may not*
crash depending not only the encodings but the actual content of the
strings at that instant. One perfectly reasonable set of tests may pass;
actual application data may fail.

Finally, ruby is asymmetrical. On input, encodings are tagged; on
output, they are ignored (by default). From files, the environment
encoding is used; from sockets, the ASCII_8BIT encoding is used. WIth
regexps, invalid strings cause an exception; with String# they do not.
It is just an utter dog's breakfast of arbitrary rules which you just
have no choice but to learn.

Some people see ruby 1.9's highly complex encoding implementation as a
triumph of engineering; I see it as design smell.

Matz and others have worked very hard to make sure that Ruby 1.9 works
well if you follow certain rules regarding your inputs and outputs.

... which one has to absorb by osmosis. Certainly the core API docs
don't give these rules; in fact they give precious little about the
encoding semantics of String. And you can't get much more of a core part
of the language than String.

Want to find out what String# does when given a string which contains
invalid characters in its declared encoding? The docs won't help you.
Try it and see. Or go to the C source code.

Of course, because every String is now two-dimensional (x = sequence of
bytes, y = Encoding) there is a much higher requirement to document
every method which acts on a string or returns on a string, because
there is a much larger variety of inputs and outputs to consider.

Take strings with invalid characters, for example, or the fact that
every returned string also has an encoding and you need to document how
it is chosen. (For example Net::HTTP: does it return strings with
encoding from the Content-Type header? You tell me)

Incidentally, strings with invalid characters are not an edge case or
only for erroneous input. Ruby encourages you to do things like:

txt = sock.read(4096) # txt likely to contain a split character
at the end

This could be dealt with if explicitly converting bytes to characters at
some point (you'd buffer the extra bit). By not having this explicit
conversion, you are quite likely to have byte patterns which don't
represent *any* character. Yes you can do the buffering yourself; I'm
just saying that all methods need to *document* whether they do accept
strings with invalid bytes, and how they handle them.

If you don't respect your encodings, they will bite you. They may not
bite you up front (as they do with Ruby, because it exposes these
things which are painful), but they *will* bite you.

Certainly you need to know about character sets and how they are
encoded. This does not imply that ruby does it in a sane way. And as I
said before, if Ruby were to bite you consistently, it would be much
better.

Ruby got it right, because it acknowledges that (a) this is hard and
(b) gives you the tools you need in order to make this less painful.
It also doesn't (c) incorrectly assume that everything is or can be
expressed safely in Unicode. (Shift-JIS will not roundtrip to Unicode
and back for some characters.)

That's kind of irrelevant, since ruby 1.9 doesn't really handle
Shift-JIS either, except to transcode it.

···

--
Posted via http://www.ruby-forum.com/\.

Matt_Mencel · 16 May 2012 15:32

Thanks Robert,

Can you explain what these new code parts are doing? I'm not clear on what is actually happening here.

  def initialize(blahblahblah)
     ........
     self.class.add(self)
  end

  self.add(cl)
    (@members ||= ) << cl
  end

···

----- Original Message -----
From: "Robert Klemme" <shortcutter@googlemail.com>
To: "ruby-talk ML" <ruby-talk@ruby-lang.org>
Sent: Wednesday, May 16, 2012 10:13:04 AM
Subject: Re: Passing attributes to initialize

On Wed, May 16, 2012 at 4:25 PM, Matt Mencel <MR-Mencel@wiu.edu> wrote:

Hi,

I'm not an OOP person. Most of the Ruby I write is just scripts and such. I've found a good place to start using real Class objects though and am having a hard time getting my head around it.

I have this class. The @members attribute is an array of several 5 digit numbers (could be between 1 and 10 values in the array). I want to be able to find objects in the Class by using one of the numbers.

class Crosslisting
attr_accessor :members, :title, :id

def initialize(members, name='', id='')
@members = members
@name = name
@id = id
end

def self.find_by_member(member)
found = nil
ObjectSpace.each_object(Crosslisting) { |o|
found = o if o.members.include?(member)
}
found
end
end

members = ['12345','67890']
my_obj = Crosslisting.new(members)
ap Crosslisting.find_by_member('12345')

# OUTPUT USING AWESOME_PRINT GEM
#<Crosslisting:0x00e77410
attr_accessor :id = "",
attr_accessor :members = [
[0] "12345",
[1] "67890"
],
attr_accessor :name = ""

This seems to work ok. Am I doing it right or is there a better way?

ObjectSpace is always a crutch. Better remember all elements
somewhere (e.g. in an instance variable of the class).

class Crosslisting
attr_accessor :members, :title, :id

def initialize(members, name='', id='')
   @members = members
   @name = name
   @id = id
   self.class.add(self)
end

def self.add(cl)
(@members ||= ) << cl
end

def self.find_by_member(member)
@members.find {|o| o.members.include?(member)} if @members
end
end

I tried to figure out how to use 'def initialize(*args)' instead of naming all the attributes, but I'm not currently smart enough to make it work apparently.

Why would you want to do that? The code of #initialize is OK the way it is.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Nikolai_Weibull2 · 22 May 2012 08:18

Austin Ziegler wrote in post #1061436:

This is *not* a Ruby problem, this is a *data* problem.

(because ruby 1.9 sorts by byte
ordering, which happens to work for UTF8 but not all other encodings of
Unicode).

Only if you want to sort by code point.

Some people see ruby 1.9's highly complex encoding implementation as a
triumph of engineering; I see it as design smell.

That must be some easily impressed people. It’s not a space rocket
(which is designed and built by a aerospace engineer, not a “rocket
scientist”, by the way.) or (every programmers favorite, it seems) a
bridge.

Matz and others have worked very hard to make sure that Ruby 1.9 works
well if you follow certain rules regarding your inputs and outputs.

... which one has to absorb by osmosis. Certainly the core API docs
don't give these rules; in fact they give precious little about the
encoding semantics of String. And you can't get much more of a core part
of the language than String.

Completely agree. This is a complex matter and should be treated as
such by the documentation. Glossing over it in the documentation only
strengthens the belief that you don’t have to know or care about
encodings.

Of course, because every String is now two-dimensional (x = sequence of
bytes, y = Encoding) there is a much higher requirement to document
every method which acts on a string or returns on a string, because
there is a much larger variety of inputs and outputs to consider.

Well, you should design your APIs to ignore these dimensions. Oh, and
always return UTF-8.

···

On Tue, May 22, 2012 at 9:37 AM, Brian Candler <lists@ruby-forum.com> wrote:

Robert_K1 · 17 May 2012 13:03

Thanks Robert,

Can you explain what these new code parts are doing? I'm not clear on what is actually happening here.

def initialize(blahblahblah)
........
self.class.add(self)
end

The part above invokes method #add on the class of this instance and
passes self (i.e. the newly created object).

self.add(cl)
(@members ||= ) << cl
end

This method adds cl (the instance passed from #initialize, "obj" would
probably be a better argument name) to an Array stored in an instance
variable of the class. The expression in brackets ensures the Array
is set.

A hint: usually it's pretty easy to try things in IRB and see what happens.

Kind regards

robert

···

On Wed, May 16, 2012 at 5:32 PM, Matt Mencel <MR-Mencel@wiu.edu> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
Encoding trouble ruby-talk	3	118	29 October 2011
UTF-8 problems ruby-talk	5	75	27 October 2006
How to send utf8 data to remote computer in ruby 1.9.2 ruby-talk	1	137	18 August 2011
How to send utf8 data to remote computer in ruby 1.9.2 ruby-talk	2	140	17 August 2011
UTF-8 encoding with BOM under Ruby 1.8.x (Windows) ruby-talk	5	100	16 August 2007

A million reasons why Encoding was a mistake

Related topics