Marnen Laibow-Koser wrote:
Huh? Normalization transformations should be pretty easy to implement.
But the point is, you can't do anything useful with this until you
*transcode* it anyway, which you can do using Iconv (in either 1.8 or
1.9).
ruby 1.9's big flag feature of being able to store a string in its
original form tagged with the encoding doesn't help the OP much, even if
it had been tagged correctly.
I mean, to a degree ruby 1.9 already supports this UTF-8-MAC as an
'encoding'. For example:
decomp = [101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103].map { |x| x.chr("UTF-8-MAC") }.join
=> "español.lng"
decomp.codepoints.to_a
=> [101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103]
decomp.encoding
=> #<Encoding:UTF8-MAC>
Notice that the n-accent is displayed as a single character by the
terminal, even though it is two codepoints (110,771)
So you could argue that Dir on the Mac is at fault here, for tagging
the string as UTF-8 when it should be UTF-8-MAC.
But you still need to transcode to UTF-8 before doing anything useful
with this string. Consider a string containing decomposed characters
tagged as UTF-8-MAC:
(1) The regexp /./ should match a series of decomposed codepoints as a
single 'character'; str[n] should fetch the nth 'character'; and so on.
I don't think this would be easy to implement, since finding a character
boundary is no longer a codepoint boundary.
What you actually get is this:
decomp.split(//)
=> ["e", "s", "p", "a", "n", "̃", "o", "l", ".", "l", "n", "g"]
Aside: note that "̃ is actually a single character, a double quote with
the accent applied!
(2) The OP wanted to match the regexp containing a single codepoint /ñ/
against the decomposed representation, which isn't going to work anyway.
That is, ruby 1.9 does not automatically transcode strings so they are
compatible; it just raises an exception if they are not.
/ñ/ =~ decomp
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
regexp with UTF8-MAC string)
from (irb):5
from /usr/local/bin/irb19:12:in `<main>'
(3) Since ruby 1.9 has a UTF-8-MAC encoding, it *should* be able to
transcode it to UTF-8 without using Iconv. However this is simply
broken, at least in the version I'm trying here.
/ñ/ =~ decomp.encode("UTF-8")
=> nil
decomp.encode("UTF-8")
=> "espa\xB1\x00ol.lng"
decomp.encode("UTF-8").codepoints.to_a
ArgumentError: invalid byte sequence in UTF-8
from (irb):10:in `codepoints'
from (irb):10:in `each'
from (irb):10:in `to_a'
from (irb):10
from /usr/local/bin/irb19:12:in `<main>'
RUBY_VERSION
=> "1.9.2"
RUBY_PATCHLEVEL
=> -1
RUBY_REVISION
=> 24186
(4) If general support for decomposed form would be added as further
'Encodings', there would be an explosion of encodings: UTF-8-D,
UTF-16LE-D, UTF-16BE-D etc, and that's ignoring the "compatible" versus
"canonical" composed and decomposed forms.
(5) It is going to be very hard (if not impossible) to make a source
code string or regexp literal containing decomposed "n" and "̃" to be
distinct from a literal containing a composed "ñ". Try it and see.
(In the above paragraph, the decomposed accent is applied to the
double-quote; that is, "̃ is actually a single 'character'). Most
editors are going to display both the composed and decomposed forms
identically.
I think this just shows that ruby 1.9's complexity is not helping in the
slightest. If you have to transcode to UTF-8 composed form, then ruby
1.8 does this just as well (and then you only need to tag the regexp as
UTF-8 using //u)
···
--
Posted via http://www.ruby-forum.com/\.