I have put together a document which tries to outline the M17N
properties of ruby 1.9 in a logical sequence and demonstrate the
important behaviours. The file is called string19.rb and you can find it
at
GitHub - candlerb/string19: Runnable documentation of ruby 1.9's M17N properties
There is test code interspersed within the comments, so you can run it
to verify the behaviours described.
I just wanted to say that I enjoyed reading through what you have created. I think you've shown a neat way to document behaviors, with your comment and code mix. Even your simple alias of assert_equal() to is() really adds to the overall presentation.
I've added a link to this repository in a comment to the first article of my m17n series to help people find it.
It does run for me on Mac OS X, though I do get a warning:
$ ruby_dev string19.rb
Loaded suite string19
Started
WARNING: got "UTF-8" as locale_charmap for LANG=C
.
Finished in 0.589675 seconds.
1 tests, 202 assertions, 0 failures, 0 errors, 0 skips
I have a few specific comments on the test suite.
* Just FYI, you ask the following about Regexp::FIXEDENCODING:
# FIXME: What is the purpose of this flag?
I do try to explain that under Regexp Encodings in this article, if you are interested:
* I'm not sure this is correct:
# 5. If one object is a String which contains only 7-bit ASCII characters
# (ascii_only?), then the objects are compatible and the result has the
# encoding of the other object.
I believe that's if one Object is a String that's ascii_only?() and the other object has an ASCII compatible Encoding. Here's the case where what you said doesn't seem to work:
$ ruby_dev -e 'p Encoding.compatible?("ascii", "abc".encode("UTF-16BE"))'
nil
* I don't believe this is accurate:
# Normally, writing a string to a file ignores the encoding property.
# However if the internal encoding is set, then the characters are
# transcoded from the internal encoding to the external encoding.
For example:
$ ruby_dev -e 'open("utf8.txt", "w:UTF-8") { |f| f.puts "abc".encode("UTF-16BE") }'
$ ruby_dev -e 'p ARGF.read' utf8.txt
"abc\n"
My understanding is that internal_encoding() is for reading only. When writing, the String#encoding() is the effective internal_encoding().
* I feel sections 22 and 23 are not impartial and need to be moved to soapbox.rb.
P.S.: I've spent enough time working on this that I felt entitled to add
another file, soapbox.rb, with my own opinion on all this. Feel free to
ignore it.
You know I just had to read this. 
Seriously, I think you raise interesting points that are worth discussing. It still feels a little quick to pass ultimate judgement without that discussion though. Given that, here are my comments for discussion.
* You always say that, because the encoding system is locale dependent, your code can break when moved to a different environment. That's all true. However, we never say the opposite, which is also true. They made the system locale dependent so it would be possible that some script written to work on local data could be moved to a different environment and work on a different type of data there without being changed. (matz has stated that this choice was mainly to ease scripting.) Obviously, nothing is guaranteed to work, but it is possible for the system to do good as well as evil.
* There are many environment differences in Ruby and other languages that have nothing to do with the encoding engine. I use fork() all the time and it doesn't even exist on Windows. You also mention the "rb" flag used on Windows to stop newline translation in your tests. It's worth noting that newline translation feature is in Ruby and Perl to help them work with data differences between the different environments. These things have been that way for a long time and I don't hear a lot of complaints about them, though I would love to have fork() on Windows just like Perl does. Also, this isn't limited to Windows. I posted on this list a few months back about some user switching code that worked for me everywhere except on Mac OS X. I'm not saying that any of this is good, but it does exist and we seem to accept it on some level.
* You say that m17n's complexity can be avoided if we just used UTF-8 everywhere and transcoded incoming and outgoing data. I agree. If we do that in Ruby 1.9 though, transcode all data as it comes in and just work with UTF-8 internally, doesn't all the complexity of m17n go away? Compatible encodings, the comparison order of differing encodings, and the like will all be non-issues. Thus it seems to me that m17n allows us to take this favored approach or take harder roads, if we so choose.
James Edward Gray II
···
On Aug 6, 2009, at 6:47 AM, Brian Candler wrote: