Double byte string numbers to_int?

I have some date strings in japanese (utf-8 encoding) like:

平成17年10月17日

using a regular expresion I can extract the year, month and day in separate
variables but the I cannot convert them to Integer so I can create a Date
object with the values, here is my script:

date_string = "平成17年10月17日"

date_string =~ /平成(\s*\S+\s*)年(\s*\S+\s*)月(\s*\S+\s*)日/

y, m, d = $1, $2, $3

puts y +", " + m +", "+ d =>17, 10, 17

So I got the numbers in y, m, and d vars but then converting them to Integers
gives me 0 as result

y.to_i => 0
m.to_i =>0
d.to_i =>0

Is there any way to convert the date string to a Date object easily??

regards,

Horacio

Selon Horacio Sanson <hsanson@moegi.waseda.jp>:

I have some date strings in japanese (utf-8 encoding) like:

平成17年10月17日

using a regular expresion I can extract the year, month and day in separate
variables but the I cannot convert them to Integer so I can create a Date
object with the values, here is my script:

date_string = "平成17年10月17日"

date_string =~ /平成(\s*\S+\s*)年(\s*\S+\s*)月(\s*\S+\s*)日/

y, m, d = $1, $2, $3

puts y +", " + m +", "+ d =>17, 10, 17

So I got the numbers in y, m, and d vars but then converting them to Integers
gives me 0 as result

y.to_i => 0
m.to_i =>0
d.to_i =>0

I may be wrong, but seeing your post without UTF-8 on, I see that the y, m and d
strings you get indeed contain the numbers, but as UTF-8 characters which are
*not* at the same position as the normal ASCII characters for numbers. I guess
that's what confuses to_i. Your strings contain characters that *look* like
numbers to us, but are not treated as such by to_i.

I'm afraid even the parsing capabilities of Date won't solve the problem. It
will probably expect digits that are in their ASCII positions rather than
Unicode characters that happen to look like numbers to us :slight_smile: .

···

--
Christophe Grandsire.

http://rainbow.conlang.free.fr

It takes a straight mind to create a twisted conlang.

Selon Christophe Grandsire <christophe.grandsire@free.fr>:

I'm afraid even the parsing capabilities of Date won't solve the problem. It
will probably expect digits that are in their ASCII positions rather than
Unicode characters that happen to look like numbers to us :slight_smile: .

I checked around, and my guess is that your original string contains "fullwidth
ASCII variants". For digits, that's U+FF10 to U+FF19. In UTF-8, those are
translated into three-byte characters. So your so-called "double byte strings"
are really six bytes long, and are not treated as containing numbers by to_i.

···

--
Christophe Grandsire.

http://rainbow.conlang.free.fr

It takes a straight mind to create a twisted conlang.

I made some testing and so far no luck getting encoded strings to convert to
Numeric values.

s = "17" => "\357\274\221\357\274\227"
puts s => 17
s.to_i =>0

I also tried converting the string with Iconv with no results (Illegal
Sequence errors).

Playing a little more I got this little method to convert the utf8 encoded
string to Fixnum

class String
  def w_to_i
                digits = self.size/3
    res = ""
    0.upto(digits-1) { |d|
      res = res + (self[(d*3)+2] - 144).to_s
    }
    res.to_i
  end
end

# Example usage
s = "0" => "\357\274\220"
s.w_to_i => 0

s = "1" => "\357\274\221"
s.w_to_i => 1

s = "51" => "\357\274\225\357\274\221"
s.w_to_i => 51
s.w_to_i.class => Fixnum

This little hack works so far but only for my specific application. Any tips
on making this better are appreciated. Also if there exist any easier way
(and I believe there must be) I will appreciate any directions.

regards,
Horacio

Friday 04 November 2005 19:54、Christophe Grandsire さんは書きました:

Selon Christophe Grandsire <christophe.grandsire@free.fr>:
> I'm afraid even the parsing capabilities of Date won't solve the problem.
> It will probably expect digits that are in their ASCII positions rather
> than Unicode characters that happen to look like numbers to us :slight_smile: .

I checked around, and my guess is that your original string contains
"fullwidth ASCII variants". For digits, that's U+FF10 to U+FF19. In UTF-8,
those are translated into three-byte characters. So your so-called "double
byte strings" are really six bytes long, and are not treated as containings

= "17"
=> "\357\274\221\357\274\227"

···

numbers by to_i. --
Christophe Grandsire.

http://rainbow.conlang.free.fr

It takes a straight mind to create a twisted conlang.

Selon Horacio Sanson <hsanson@moegi.waseda.jp>:

I made some testing and so far no luck getting encoded strings to convert to
Numeric values.

s = "17" => "\357\274\221\357\274\227"
puts s => 17
s.to_i =>0

I also tried converting the string with Iconv with no results (Illegal
Sequence errors).

That's normal. Those characters are just Unicode characters, without any more
meaning (as far as to_i is concerned) than any other Japanese kanji or whatever
sign you might find in Unicode.

Playing a little more I got this little method to convert the utf8 encoded
string to Fixnum

class String
  def w_to_i
                digits = self.size/3
    res = ""
    0.upto(digits-1) { |d|
      res = res + (self[(d*3)+2] - 144).to_s
    }
    res.to_i
  end
end

# Example usage
s = "0" => "\357\274\220"
s.w_to_i => 0

s = "1" => "\357\274\221"
s.w_to_i => 1

s = "51" => "\357\274\225\357\274\221"
s.w_to_i => 51
s.w_to_i.class => Fixnum

This little hack works so far but only for my specific application. Any tips
on making this better are appreciated. Also if there exist any easier way
(and I believe there must be) I will appreciate any directions.

I don't believe there is. The problem here is probably not to solve even if we
had a perfectly Unicode-aware language. The big problem is that besides the
ASCII digits, Unicode also has digits for plenty of other languages, which may
not even use the positional system our digits use. At what point should to_i be
aware of those digits? If we decide that to_i should be aware of both ASCII
digits and fullwidth ASCII digits, shouldn't it also be aware of Indic digits
(used for instance in Arabic, in the same positional system as ours)? What
about Devanagari digits (for Hindi), Tibetan digits, Mongolian digits, Thai
digits? While we're there, what about the Japanese kanji used as digits? What
about Roman numerals? Where should to_i stop being aware of the numeric nature
of the characters it receives? What happens when Unicode gets updated? And more
important: what do we do with alternative encodings? I'm not only talking about
other Unicode encodings besides UTF-8, but also the non-Unicode encodings,
especially those used for Asian languages.

This problem doesn't have a general solution I'm afraid. One just can't account
for all the different cases...

···

--
Christophe Grandsire.

http://rainbow.conlang.free.fr

It takes a straight mind to create a twisted conlang.