James Gray wrote:
I've got the majority of the new functionality covered in my m17n
series now:
http://blog.grayproductions.net/articles/understanding_m17n
I expect to have the minor side topics I'm still missing covered in
the next few weeks.
This is a good start, but I think it just scratches the surface.
Questions which immediately spring to mind:
* What is the nature of the "compatible" relationship? Does A compatible
with B imply B compatible with A? It's not commutative:
irb(main):002:0> a = "abc".force_encoding("UTF-8")
=> "abc"
irb(main):003:0> b = "def".force_encoding("ISO-8859-1")
=> "def"
irb(main):004:0> Encoding.compatible?(a,b)
=> #<Encoding:UTF-8>
irb(main):005:0> Encoding.compatible?(b,a)
=> #<Encoding:ISO-8859-1>
Also, it's not encodings which are compatible, but actual strings. Two
strings may or may not be compatible, dependent not just on their
encoding, but on their actual content at that instant.
irb(main):006:0> a = "abc\xff".force_encoding("UTF-8")
=> "abc\xFF"
irb(main):007:0> b = "def\xff".force_encoding("ISO-8859-1")
=> "def�"
irb(main):008:0> Encoding.compatible?(a,b)
=> nil
* What about string literals which include escape sequences like \u?
This seems to override the source encoding rule.
$ ruby19
# encoding: ISO-8859-1
puts "abc".encoding
puts "abc\u1234".encoding
^D
ISO-8859-1
UTF-8
* What encoding is chosen for regexp literals? (Seems to be different
rules to string literals). What about string literals which include
#{interpolation}? What about regexp literals which include
#{interpolation}?
* What source encoding and external encoding is used in irb?
* I think it will be worth explaining what you need to do to handle
binary data (using "rb" and "wb", the ASCII-8BIT encoding, how to set
external encoding for STDIN, the fact that read() and gets() return
different encodings for the same data...)
* What actually happens if you use string operations on two strings with
different encodings? e.g. str1 == str2, str1 + str2, str1 << str2? What
about indexing a hash with two strings which are identical byte
sequences but different encodings?
* What do C extension writers need to know about strings? It seems at
the moment there is some magic hidden state (ENC_CODERANGE_7BIT) which
you must remember to update whenever you create or modify a string, and
if you don't, things break badly.
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/329267
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/23155
Regards,
Brian.
···
--
Posted via http://www.ruby-forum.com/\.