[ANN] 1.9 String and M17N documentation

Brian_Candler · 6 August 2009 11:47

I have put together a document which tries to outline the M17N
properties of ruby 1.9 in a logical sequence and demonstrate the
important behaviours. The file is called string19.rb and you can find it
at

There is test code interspersed within the comments, so you can run it
to verify the behaviours described.

P.S.: I've spent enough time working on this that I felt entitled to add
another file, soapbox.rb, with my own opinion on all this. Feel free to
ignore it.

···

--
Posted via http://www.ruby-forum.com/.

Greg_Brown1 · 6 August 2009 13:25

Clever approach and looks to be a great resource. Thanks for writing this up.

-greg

···

On Thu, Aug 6, 2009 at 7:47 AM, Brian Candler<b.candler@pobox.com> wrote:

I have put together a document which tries to outline the M17N
properties of ruby 1.9 in a logical sequence and demonstrate the
important behaviours. The file is called string19.rb and you can find it
at

GitHub - candlerb/string19: Runnable documentation of ruby 1.9's M17N properties

There is test code interspersed within the comments, so you can run it
to verify the behaviours described.

James_Edward_Gray_II · 6 August 2009 15:57

I have put together a document which tries to outline the M17N
properties of ruby 1.9 in a logical sequence and demonstrate the
important behaviours. The file is called string19.rb and you can find it
at

GitHub - candlerb/string19: Runnable documentation of ruby 1.9's M17N properties

There is test code interspersed within the comments, so you can run it
to verify the behaviours described.

I just wanted to say that I enjoyed reading through what you have created. I think you've shown a neat way to document behaviors, with your comment and code mix. Even your simple alias of assert_equal() to is() really adds to the overall presentation.

I've added a link to this repository in a comment to the first article of my m17n series to help people find it.

It does run for me on Mac OS X, though I do get a warning:

$ ruby_dev string19.rb
Loaded suite string19
Started
WARNING: got "UTF-8" as locale_charmap for LANG=C
.
Finished in 0.589675 seconds.

1 tests, 202 assertions, 0 failures, 0 errors, 0 skips

I have a few specific comments on the test suite.

* Just FYI, you ask the following about Regexp::FIXEDENCODING:

# FIXME: What is the purpose of this flag?

I do try to explain that under Regexp Encodings in this article, if you are interested:

* I'm not sure this is correct:

# 5. If one object is a String which contains only 7-bit ASCII characters
# (ascii_only?), then the objects are compatible and the result has the
# encoding of the other object.

I believe that's if one Object is a String that's ascii_only?() and the other object has an ASCII compatible Encoding. Here's the case where what you said doesn't seem to work:

$ ruby_dev -e 'p Encoding.compatible?("ascii", "abc".encode("UTF-16BE"))'
nil

* I don't believe this is accurate:

# Normally, writing a string to a file ignores the encoding property.
# However if the internal encoding is set, then the characters are
# transcoded from the internal encoding to the external encoding.

For example:

$ ruby_dev -e 'open("utf8.txt", "w:UTF-8") { |f| f.puts "abc".encode("UTF-16BE") }'
$ ruby_dev -e 'p ARGF.read' utf8.txt
"abc\n"

My understanding is that internal_encoding() is for reading only. When writing, the String#encoding() is the effective internal_encoding().

* I feel sections 22 and 23 are not impartial and need to be moved to soapbox.rb.

P.S.: I've spent enough time working on this that I felt entitled to add
another file, soapbox.rb, with my own opinion on all this. Feel free to
ignore it.

You know I just had to read this.

Seriously, I think you raise interesting points that are worth discussing. It still feels a little quick to pass ultimate judgement without that discussion though. Given that, here are my comments for discussion.

* You always say that, because the encoding system is locale dependent, your code can break when moved to a different environment. That's all true. However, we never say the opposite, which is also true. They made the system locale dependent so it would be possible that some script written to work on local data could be moved to a different environment and work on a different type of data there without being changed. (matz has stated that this choice was mainly to ease scripting.) Obviously, nothing is guaranteed to work, but it is possible for the system to do good as well as evil.

* There are many environment differences in Ruby and other languages that have nothing to do with the encoding engine. I use fork() all the time and it doesn't even exist on Windows. You also mention the "rb" flag used on Windows to stop newline translation in your tests. It's worth noting that newline translation feature is in Ruby and Perl to help them work with data differences between the different environments. These things have been that way for a long time and I don't hear a lot of complaints about them, though I would love to have fork() on Windows just like Perl does. Also, this isn't limited to Windows. I posted on this list a few months back about some user switching code that worked for me everywhere except on Mac OS X. I'm not saying that any of this is good, but it does exist and we seem to accept it on some level.

* You say that m17n's complexity can be avoided if we just used UTF-8 everywhere and transcoded incoming and outgoing data. I agree. If we do that in Ruby 1.9 though, transcode all data as it comes in and just work with UTF-8 internally, doesn't all the complexity of m17n go away? Compatible encodings, the comparison order of differing encodings, and the like will all be non-issues. Thus it seems to me that m17n allows us to take this favored approach or take harder roads, if we so choose.

James Edward Gray II

···

On Aug 6, 2009, at 6:47 AM, Brian Candler wrote:

Brian_Candler · 6 August 2009 16:44

James Gray wrote:

I just wanted to say that I enjoyed reading through what you have
created. I think you've shown a neat way to document behaviors, with
your comment and code mix. Even your simple alias of assert_equal()
to is() really adds to the overall presentation.

Thanks James.

It does run for me on Mac OS X, though I do get a warning:

$ ruby_dev string19.rb
Loaded suite string19
Started
WARNING: got "UTF-8" as locale_charmap for LANG=C
.
Finished in 0.589675 seconds.

Hmm. Could you try setting replacing 'LANG' with 'LC_ALL' globally? A
reread of the setlocale(3) manpage under Linux shows that LANG is only
tried as a last resort, so perhaps your Mac has a higher-priority
environment variable set.

* I'm not sure this is correct:

# 5. If one object is a String which contains only 7-bit ASCII
characters
# (ascii_only?), then the objects are compatible and the result has the
# encoding of the other object.

Thank you, fixed.

* I don't believe this is accurate:

# Normally, writing a string to a file ignores the encoding property.

I think we crossed over on that one. I spotted the error after
re-reading your articles and posted a correction - I think it's right
now.

* You say that m17n's complexity can be avoided if we just used UTF-8
everywhere and transcoded incoming and outgoing data. I agree. If we
do that in Ruby 1.9 though, transcode all data as it comes in and just
work with UTF-8 internally, doesn't all the complexity of m17n go
away? Compatible encodings, the comparison order of differing
encodings, and the like will all be non-issues.

Yes, for scripts that process text. And in practice, this is what most
people processing text will find: their source is in their preferred
encoding, their external files are in their preferred encoding, and
everything "just works" - pretty much in the way that ruby 1.8 did with
$KCODE.

I have two key problems.

1. Working with binary. I can force the encoding on my own source files,
and I can force the encoding on any files that I open, but I still have
to interact with other libraries which return strings. If I build a
string by concatenating strings taken from elsewhere, I have to force
the encodings. If I forget, it may work sometimes (if those strings are
7-bit), but will fail if they are 8-bit.

Maybe this could be fixed by making the ASCII-8BIT encoding be
compatible with everything, and always give an ASCII-8BIT result. But
that would be saying, in essence, an ASCII-8BIT String is one class of
object, and everything else is another class.

2. Working with other people's libraries.

Take REXML as an example. Suppose I decide I want to do this:

doc = REXML::Document.new(src)

Under 1.8, I could do this without worrying. But under 1.9, a whole host
of questions tumble out.

- will REXML require me to have set the src to the correct encoding?
- in order to parse it, will it reset the encoding of my 'src' object?
What will it do if 'src' is frozen? Will it dup the string?

XML documents carry their encoding within them. There's the xml charset
declaration, and the BOM, and failing that the document is UTF-8 by
definition, because if it were in a different encoding, then it *must*
declare it:

http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding

So I reckon REXML should ignore the encoding of src. Even if it were
tagged as (say) ISO-8859-1 because that's the locale encoding, or
ASCII-8BIT because it came from a socket, it should be treated as UTF-8
unless declared otherwise. And then if I access the node using #text,
would I get something tagged as UTF-8, or something else?

The only way to be sure is to try it and see (and a quick test suggests
that it does work in the way I described).

But this process has to be repeated for every library you might use.

···

--
Posted via http://www.ruby-forum.com/\.

Eric_Hodel1 · 7 August 2009 00:39

I'm too lazy to dig this out of the archives, but there are some encodings that don't have a 1:1 mapping to Unicode thus the round-trip through UTF-8 (etc.) will destroy them.

In short, Ruby doesn't transcode everything to preserve the integrity of your data.

···

On Aug 6, 2009, at 08:57, James Gray wrote:

* You say that m17n's complexity can be avoided if we just used UTF-8 everywhere and transcoded incoming and outgoing data. I agree. If we do that in Ruby 1.9 though, transcode all data as it comes in and just work with UTF-8 internally, doesn't all the complexity of m17n go away? Compatible encodings, the comparison order of differing encodings, and the like will all be non-issues. Thus it seems to me that m17n allows us to take this favored approach or take harder roads, if we so choose.

Brian_Candler · 12 August 2009 10:13

James Gray wrote:

* Just FYI, you ask the following about Regexp::FIXEDENCODING:

# FIXME: What is the purpose of this flag?

I do try to explain that under Regexp Encodings in this article, if
you are interested:

Gray Soft / Not Found

"A fixed_encoding?() Regexp is one that will raise an
Encoding::CompatibilityError if matched against any String that contains
a different Encoding from the Regexp itself."

I think that's not exactly correct:

$ irb19 --simple-prompt

re = /gro/u

=> /gro/

re.encoding

=> #<Encoding:UTF-8>

re.fixed_encoding?

=> true

str = "gro".force_encoding("ISO-8859-1")

=> "gro"

re =~ str

=> 0

AFAICS, it will only raise an error if the matched string is of a
different encoding *and* is not ascii_only?

str = "gro\xdf".force_encoding("ISO-8859-1")

=> "gro�"

re =~ str

Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
regexp with ISO-8859-1 string)

···

--
Posted via http://www.ruby-forum.com/\.

James_Edward_Gray_II · 6 August 2009 18:33

James Gray wrote:

It does run for me on Mac OS X, though I do get a warning:

$ ruby_dev string19.rb
Loaded suite string19
Started
WARNING: got "UTF-8" as locale_charmap for LANG=C
.
Finished in 0.589675 seconds.

Hmm. Could you try setting replacing 'LANG' with 'LC_ALL' globally? A
reread of the setlocale(3) manpage under Linux shows that LANG is only
tried as a last resort, so perhaps your Mac has a higher-priority
environment variable set.

I bet the issue is this line in my .bashrc:

export LC_CTYPE=en_US.UTF-8

I have two key problems.

1. Working with binary. I can force the encoding on my own source files,
and I can force the encoding on any files that I open, but I still have
to interact with other libraries which return strings. If I build a
string by concatenating strings taken from elsewhere, I have to force
the encodings. If I forget, it may work sometimes (if those strings are
7-bit), but will fail if they are 8-bit.

Maybe this could be fixed by making the ASCII-8BIT encoding be
compatible with everything, and always give an ASCII-8BIT result. But
that would be saying, in essence, an ASCII-8BIT String is one class of
object, and everything else is another class.

I think I understand what you are saying here. You have a good point that is would be annoying to have the Encoding of the JPEG you are building up from ASCII-8BIT to UTF-8.

2. Working with other people's libraries.

Take REXML as an example. Suppose I decide I want to do this:

doc = REXML::Document.new(src)

Under 1.8, I could do this without worrying.

Really?

What did it do under Ruby 1.8 when fed an XML document that was UTF-16 encoded? Will it read it? When I do searches for content, will it hand me UTF-16 or UTF-8? These are just some questions that jump to my mind.

As you've said, about the best I can think of is to test it and find out, only this is Ruby 1.8 I'm talking about here.

Let's see how it works:

$ ruby -r rexml/document -e 'REXML::Document.new(ARGF.read)' utf16_with_bom.xml
/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse': #<Iconv::InvalidCharacter: "\340\250\274\347\215\257\346\265\245\347\221\241\346\234\276\345\215\257\346\265\245\342\201\203\346\275\256\347\221\245\346\271\264\343\260\257\347\215\257\346\265\245\347\221\241\346\234\276", ["\n"]> (REXML::ParseException)
/usr/local/lib/ruby/1.8/rexml/encodings/ICONV.rb:7:in `conv'
/usr/local/lib/ruby/1.8/rexml/encodings/ICONV.rb:7:in `decode'
/usr/local/lib/ruby/1.8/rexml/source.rb:57:in `encoding='
/usr/local/lib/ruby/1.8/rexml/parsers/baseparser.rb:213:in `pull'
/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:22:in `parse'
/usr/local/lib/ruby/1.8/rexml/document.rb:227:in `build'
/usr/local/lib/ruby/1.8/rexml/document.rb:43:in `initialize'
-e:1:in `new'
-e:1
...
"\n"
Line:
Position:
Last 80 unconsumed characters:
  <sometag>Some Content</sometag>
  from /usr/local/lib/ruby/1.8/rexml/document.rb:227:in `build'
  from /usr/local/lib/ruby/1.8/rexml/document.rb:43:in `initialize'
  from -e:1:in `new'
  from -e:1

Ah, it just tells me my data is invalid. It's not though:

$ iconv -f UTF-16BE -t UTF-8 < utf16_with_bom.xml
<?xml version="1.0" encoding="UTF-16BE"?>
<sometag>Some Content</sometag>

Ruby 1.9 can read it:

$ ruby_dev -r rexml/document -e 'puts REXML::Document.new(ARGF.read.force_encoding("BINARY")).to_s' utf16_with_bom.xml
<?xml version='1.0' encoding='UTF-16BE'?>
<sometag>Some Content</sometag>

It looks like it's suppose to work in Ruby 1.8 too and I've just hit a bug. At least, if I'm reading the source right. I had to check.

Anyway, the point of all this is that it really isn't any easier, for me, to reason about Ruby 1.8 encoding behavior. Ruby 1.9 didn't invent character encodings, it just started paying attention to them as we all should have been doing all along. That's all my opinion, of course.

James Edward Gray II

···

On Aug 6, 2009, at 11:44 AM, Brian Candler wrote:

Brian_Candler · 7 August 2009 07:52

Eric Hodel wrote:

I'm too lazy to dig this out of the archives, but there are some
encodings that don't have a 1:1 mapping to Unicode thus the round-trip
through UTF-8 (etc.) will destroy them.

Indeed, although we're both having a hard time thinking of an actual
example. It seems that dealing with such things is not an everyday
requirement for most people. So you write a library for that, and then
the rest of us aren't saddled with the complexity.

···

--
Posted via http://www.ruby-forum.com/\.

James_Edward_Gray_II · 12 August 2009 14:35

Thanks for the correction. I've updated the article you quoted with a correction.

James Edward Gray II

···

On Aug 12, 2009, at 5:13 AM, Brian Candler wrote:

James Gray wrote:

* Just FYI, you ask the following about Regexp::FIXEDENCODING:

# FIXME: What is the purpose of this flag?

I do try to explain that under Regexp Encodings in this article, if
you are interested:

Gray Soft / Not Found

"A fixed_encoding?() Regexp is one that will raise an
Encoding::CompatibilityError if matched against any String that contains
a different Encoding from the Regexp itself."

I think that's not exactly correct:

$ irb19 --simple-prompt

re = /gro/u

=> /gro/

re.encoding

=> #<Encoding:UTF-8>

re.fixed_encoding?

=> true

str = "gro".force_encoding("ISO-8859-1")

=> "gro"

re =~ str

=> 0

AFAICS, it will only raise an error if the matched string is of a
different encoding *and* is not ascii_only?

str = "gro\xdf".force_encoding("ISO-8859-1")

=> "gro�"

re =~ str

Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
regexp with ISO-8859-1 string)

lim · 6 August 2009 18:53

Ruby 1.9 didn't invent character encodings

Just out of curiosity. Are there other languages that handle encodings
the way ruby 1.9 does?

Brian_Candler · 7 August 2009 08:07

James Gray wrote:

2. Working with other people's libraries.

Take REXML as an example. Suppose I decide I want to do this:

doc = REXML::Document.new(src)

Under 1.8, I could do this without worrying.

Really?

What did it do under Ruby 1.8 when fed an XML document that was UTF-16
encoded? Will it read it? When I do searches for content, will it
hand me UTF-16 or UTF-8? These are just some questions that jump to
my mind.

OK, I didn't write my statement clearly enough.

In ruby 1.8, the question is, "will it parse this document?"

In ruby 1.9, the question is, "will it parse this document, *and* does
the correct parsing depend on which encoding I set the 'src' string to,
and if so, what do I need to set it to?"

Then take a method which returns a string, say REXML::Element.text().
This is a bit simpler.

In ruby 1.8, the question is, "does this return the content of my
element, and has it been transcoded?"

In ruby 1.9, the question is the same, *plus* "what encoding does it set
on that value?"

require 'rexml/document'

=> true

d = REXML::Document.new("<?xml encoding='iso-8859-1'?><root>\xfcber</root>")

=> <UNDEFINED> ... </>

d.elements.first

=> <root> ... </>

d.elements.first.text

=> "über"

d.elements.first.text.encoding

=> #<Encoding:UTF-8>

OK, so it looks like REXML has transcoded to UTF-8, and tagged the
result as such. I'm not really helping my case because you have to do
the same test with 1.8:

require 'rexml/document'

=> true

d = REXML::Document.new("<?xml encoding='iso-8859-1'?><root>\xfcber</root>")

=> <UNDEFINED> ... </>

d.elements[1]

=> <root> ... </>

d.elements[1].text

=> "\303\274ber"

So it's been transcoded here too. But I don't have to worry about what
encoding 'tag' it has been given.

Maybe all this would be much simpler if Ruby didn't crash when given
incompatible encodings, but transcoded the right-hand-side
automatically. For example:

a << b
# a keeps its original encoding, b is transcoded to a's encoding

a.tr("ü","Ü")
# the ü and Ü are transcoded to a's encoding first

- with transcoding to BINARY being a null operation.

···

--
Posted via http://www.ruby-forum.com/\.

Eric_Hodel1 · 11 August 2009 22:45

Eric Hodel wrote:

I'm too lazy to dig this out of the archives, but there are some
encodings that don't have a 1:1 mapping to Unicode thus the round-trip
through UTF-8 (etc.) will destroy them.

Indeed, although we're both having a hard time thinking of an actual
example. It seems that dealing with such things is not an everyday
requirement for most people.

This seems to be similar to the reasoning behind two-digit years.

So you write a library for that, and then the rest of us aren't saddled with the complexity.

Unfortunately, software ends up getting used in places the author didn't expect. Why not write robust software the first time instead of being lazy?

···

On Aug 7, 2009, at 00:52, Brian Candler wrote:

Brian_Candler · 12 August 2009 09:02

Eric Hodel wrote:

Eric Hodel wrote:

I'm too lazy to dig this out of the archives, but there are some
encodings that don't have a 1:1 mapping to Unicode thus the round-
trip
through UTF-8 (etc.) will destroy them.

Indeed, although we're both having a hard time thinking of an actual
example. It seems that dealing with such things is not an everyday
requirement for most people.

This seems to be similar to the reasoning behind two-digit years.

I don't understand what you're getting at. Obviously the round trip
4-digit-years -> 2-digit-years -> 4-digit-years is not lossless, but
that would be a silly thing to do (i.e. if you've captured
4-digit-years, then you store them and work with them as 4-digit-years).

You're saying you want to avoid external->Unicode->external encoding
transcodings. But these are rarely problematic (I've still not seen an
example), and in those rare cases you could just handle the external
encoding as binary data. Remember also that for stateful encodings,
you're forced to transcode anyway - even ruby 1.9 won't handle snippets
of ISO_2022_JP in isolation, for example.

So you write a library for that, and then the rest of us aren't
saddled with the complexity.

Unfortunately, software ends up getting used in places the author
didn't expect. Why not write robust software the first time instead
of being lazy?

In My Opinion (which may not be shared by anyone else), ruby 1.9's
String implementation is anything but robust. It's over-complicated,
under-specified, buggy as hell, and badly gets in the way when you want
to work with binary data or write programs which don't crash when given
unexpected input.

If it were optional, it would be fine. Since it's a mandatory part of
the language, it destroys it for me. Ruby 1.8 is a fine general purpose
language; ruby 1.9 is a text-processing language (and may still trip you
up even in that case)

Regards,

Brian.

···

On Aug 7, 2009, at 00:52, Brian Candler wrote:

--
Posted via http://www.ruby-forum.com/\.

Brian_Candler · 12 August 2009 09:18

BTW, I find James's writeup of what he had to do to the CSV library (*)
enlightening. Even ruby 1.9 won't match an ASCII regexp like /,/ against
a wide encoding, so he had to generate new regexps dynamically at
runtime.

Now, I think that's a good thing, optimising the regexps to match the
incoming data stream efficiently. But I also observe that this would
have worked just fine if the encoding were a property of the regexp only
- which is the approach 1.8 takes to regexps. What I mean is, once you
have decided to build a "UTF-16LE" regexp, say, you can just match it
against a stream of bytes.

Making every single String also have an encoding property only gives
more opportunities for Ruby to raise exceptions. Some may argue this is
Ruby "protecting" you from doing something silly, but if I'm working
with string literals or binary data returned from a library, whose
encoding may or may not have been set to ASCII-8BIT, then I don't want
this "protection". Rather, I need protecting against ruby 1.9.

There is only one case I can see where having the encoding be a property
of the String itself is useful: selecting individual characters by
index. e.g.

   if str.size > 50
     str = str[0,47] + "..."
   end

There's a huge amount of language pain introduced just for that.

Regards,

Brian.

(*) http://blog.grayproductions.net/articles/what_ruby_19_gives_us

···

--
Posted via http://www.ruby-forum.com/.

Greg_Brown1 · 12 August 2009 13:15

I'm not sure what binary data you've been having such great problems
with. Prawn deals with a lot of binary data, and yes, we needed to
make sure that it was being loaded as such and not accidentally
treated as encoded bytes, but I really didn't find this to be a major
undertaking. I guess this is because we didn't need to port over
existing 1.8 code and wrote our implementation with 1.9 in mind, but
maybe I'm missing some big problem that we didn't hit in our use case.

On a personal note, I wish you'd cut out the vitriol, because you're
acting like a jerk. You have learned a lot about the M17n system and
produced valuable resources in the process, and have helped uncovered
dark corners and bugs, and for that, the community can be appreciated
for the efforts. But if you manage to make everyone feel miserable
in the process with your abrasive attitude, I don't think that's going
to do anything for anyone.

You've made your feelings about the design decisions very clear. Now
can you maybe stick to the technical details so that these discussions
don't become nasty unnecessarily?

-greg

···

On Wed, Aug 12, 2009 at 5:02 AM, Brian Candler<b.candler@pobox.com> wrote:

In My Opinion (which may not be shared by anyone else), ruby 1.9's
String implementation is anything but robust. It's over-complicated,
under-specified, buggy as hell, and badly gets in the way when you want
to work with binary data or write programs which don't crash when given
unexpected input.

Eric_Hodel1 · 12 August 2009 17:32

Eric Hodel wrote:

Eric Hodel wrote:

I'm too lazy to dig this out of the archives, but there are some
encodings that don't have a 1:1 mapping to Unicode thus the round-
trip
through UTF-8 (etc.) will destroy them.

Indeed, although we're both having a hard time thinking of an actual
example. It seems that dealing with such things is not an everyday
requirement for most people.

This seems to be similar to the reasoning behind two-digit years.

I don't understand what you're getting at.

"dealing with [non 1:1 conversion round trips] is not an everyday requirement for most people" is roughly equivalent to "four-digit years is not an everyday requirement for most people" (or was, back when people were using two-digit years)

You're saying you want to avoid external->Unicode->external encoding
transcodings.

I was stating that this is a design goal of ruby's encoding features. (And likely causes much of the pain you feel in this area.)

But these are rarely problematic (I've still not seen an
example), and in those rare cases you could just handle the external
encoding as binary data.

Agreed. Furthermore, most of the time software is likely to only work within a single encoding.

Remember also that for stateful encodings, you're forced to transcode anyway - even ruby 1.9 won't handle snippets of ISO_2022_JP in isolation, for example.

Software written without this in mind will probably be used this way regardless of the original authors' intent (and will break), much like two-digit-year software did when four-digit years became necessary.

PS: I think you can provide valuable input on how to make ruby's API for encodings more robust and easier to use, but you seem to hate it so much that you can't be bothered to raise issues in a way that will get them fixed.

···

On Aug 12, 2009, at 02:02, Brian Candler wrote:

On Aug 7, 2009, at 00:52, Brian Candler wrote:

Brian_Candler · 12 August 2009 19:42

Eric Hodel wrote:

PS: I think you can provide valuable input on how to make ruby's API
for encodings more robust and easier to use, but you seem to hate it
so much that you can't be bothered to raise issues in a way that will
get them fixed.

It's not so much "can't be bothered", as "don't believe that a U-turn is
going to happen".

Maybe some bandaids would be accepted (e.g. ASCII-8BIT is compatible
with everything and forces the result to ASCII-8BIT), but I'm hesitant
to propose enlarging the ruleset further.

···

--
Posted via http://www.ruby-forum.com/\.

Greg_Brown1 · 12 August 2009 19:53

I think this is a good change that would at least cause mistakes to
fail faster.

I also suggested a simple binary string syntax on ruby-core, allowing:

%b{GIF} to be shorthand for "GIF".force_encoding("BINARY")

(Though that's admittedly more cosmetic than functionally significant)

A U-Turn is very unlikely to happen, but I imagine Matz will be
receptive for polishing things around the edges.

···

On Wed, Aug 12, 2009 at 3:42 PM, Brian Candler<b.candler@pobox.com> wrote:

Maybe some bandaids would be accepted (e.g. ASCII-8BIT is compatible
with everything and forces the result to ASCII-8BIT), but I'm hesitant
to propose enlarging the ruleset further.

Brian_Candler · 13 August 2009 09:14

Gregory Brown wrote:

A U-Turn is very unlikely to happen, but I imagine Matz will be
receptive for polishing things around the edges.

I have put a few ideas in a document 'alternatives.markdown' at the same
location.

The other possibility which may make sense is to transcode
automatically. For example, in

s3 = s1 + s2

then s2 is transcoded to s1's encoding, and the result s3 always has
s1's encoding.

That could actually be useful in helping to combine strings from
different sources. All the compatibility rules would vanish, and rather
than raising exceptions, ruby would just "do the right thing".
Transcoding to BINARY/ASCII-8BIT would be a null operation, so building
a binary string would be safe too.

This isn't a total U-turn, but it would be quite a major shift and I
suspect too big for 1.9.x.

···

--
Posted via http://www.ruby-forum.com/\.

Greg_Brown1 · 13 August 2009 11:40

Yeah, this is also a reasonable behavior, IMO. However, I think Matz
has some reservation about (potentially lossy) transcoding, which is
the reason for the M17N system in the first place. Special casing
form ASCII-8BIT might be more conservative.

-greg

···

On Thu, Aug 13, 2009 at 5:14 AM, Brian Candler<b.candler@pobox.com> wrote:

That could actually be useful in helping to combine strings from
different sources. All the compatibility rules would vanish, and rather
than raising exceptions, ruby would just "do the right thing".
Transcoding to BINARY/ASCII-8BIT would be a null operation, so building
a binary string would be safe too.

This isn't a total U-turn, but it would be quite a major shift and I
suspect too big for 1.9.x.

Topic		Replies	Views
Ruby 1.9 hates you and me and the encodings we rode in on so just get used to it ruby-talk	28	213	31 December 2009
Querying using HTTP ruby-talk	12	95	17 April 2009
Ruby 1.8 vs 1.9 ruby-talk	84	305	1 December 2010
A million reasons why Encoding was a mistake ruby-talk	14	142	22 May 2012
Unicode roadmap? ruby-talk	262	656	1 June 2007

[ANN] 1.9 String and M17N documentation

Related topics