Selon Horacio Sanson <hsanson@moegi.waseda.jp>:
I made some testing and so far no luck getting encoded strings to convert to
Numeric values.
s = "ï¼ï¼" => "\357\274\221\357\274\227"
puts s => ï¼ï¼
s.to_i =>0
I also tried converting the string with Iconv with no results (Illegal
Sequence errors).
That's normal. Those characters are just Unicode characters, without any more
meaning (as far as to_i is concerned) than any other Japanese kanji or whatever
sign you might find in Unicode.
Playing a little more I got this little method to convert the utf8 encoded
string to Fixnum
class String
def w_to_i
digits = self.size/3
res = ""
0.upto(digits-1) { |d|
res = res + (self[(d*3)+2] - 144).to_s
}
res.to_i
end
end
# Example usage
s = "ï¼" => "\357\274\220"
s.w_to_i => 0
s = "ï¼" => "\357\274\221"
s.w_to_i => 1
s = "ï¼ï¼" => "\357\274\225\357\274\221"
s.w_to_i => 51
s.w_to_i.class => Fixnum
This little hack works so far but only for my specific application. Any tips
on making this better are appreciated. Also if there exist any easier way
(and I believe there must be) I will appreciate any directions.
I don't believe there is. The problem here is probably not to solve even if we
had a perfectly Unicode-aware language. The big problem is that besides the
ASCII digits, Unicode also has digits for plenty of other languages, which may
not even use the positional system our digits use. At what point should to_i be
aware of those digits? If we decide that to_i should be aware of both ASCII
digits and fullwidth ASCII digits, shouldn't it also be aware of Indic digits
(used for instance in Arabic, in the same positional system as ours)? What
about Devanagari digits (for Hindi), Tibetan digits, Mongolian digits, Thai
digits? While we're there, what about the Japanese kanji used as digits? What
about Roman numerals? Where should to_i stop being aware of the numeric nature
of the characters it receives? What happens when Unicode gets updated? And more
important: what do we do with alternative encodings? I'm not only talking about
other Unicode encodings besides UTF-8, but also the non-Unicode encodings,
especially those used for Asian languages.
This problem doesn't have a general solution I'm afraid. One just can't account
for all the different cases...
···
--
Christophe Grandsire.
http://rainbow.conlang.free.fr
It takes a straight mind to create a twisted conlang.