Which encoding causes fewest problems in Ruby 1.8.2?

Conductor · 11 June 2006 01:12

I posted a similar question in the rails group but this is more specific
to ruby 1.8.2.

I read that ruby has problems with multibyte charsets. And I read that
there might be some problems with ISO-8859-15 related to REXML. And I
read that regex might have problems with ISO-8859-1.

Given the above problems (or rumors), which encoding is recommended for
use with ruby 1.8.2?

UTF-8
ISO-8859-1
ISO-8859-15

I'm certain both UTF-8 and ISO-8859-15 will support all the characters
I'll ever use. And ISO-8859-1 only lacks a couple characters I might
use on very rare occassions so I'm just looking for a charset that will
cause fewest problems with Ruby.

Thanks in advance for any suggestions.

···

--
Posted via http://www.ruby-forum.com/.

Michal_hramrach_Such · 11 June 2006 10:56

I posted a similar question in the rails group but this is more specific
to ruby 1.8.2.

I read that ruby has problems with multibyte charsets. And I read that
there might be some problems with ISO-8859-15 related to REXML. And I
read that regex might have problems with ISO-8859-1.

Given the above problems (or rumors), which encoding is recommended for
use with ruby 1.8.2?

None. They all cause problems. With utf-8 most string functions won't
work correctly (probably including regexps). There are special
extensions to work around this to some extent.

ISO-8858-1 and ISO-8859-15 should be pretty much the same. They are
simple 8-bit so the string functions that expect 1-byte characters
work. They won't allow you to use slightly more exotic characters
(like greek letters for maths, ...).

···

On 6/11/06, Jim Smith <nospam@nospam.lan> wrote:

UTF-8
ISO-8859-1
ISO-8859-15

I'm certain both UTF-8 and ISO-8859-15 will support all the characters
I'll ever use. And ISO-8859-1 only lacks a couple characters I might
use on very rare occassions so I'm just looking for a charset that will
cause fewest problems with Ruby.

Thanks in advance for any suggestions.

Yukihiro_Matsumoto2 · 11 June 2006 12:14

Hi,

···

In message "Re: Which encoding causes fewest problems in Ruby 1.8.2?" on Sun, 11 Jun 2006 10:12:45 +0900, Jim Smith <nospam@nospam.lan> writes:

Given the above problems (or rumors), which encoding is recommended for
use with ruby 1.8.2?

UTF-8
ISO-8859-1
ISO-8859-15

String and Regexp handles all of them for most of the cases. But
upper/lower case handling for non ASCII alphabets are not supported.
Use -Ku for UTF-8 and -Kn for ISO-8859-*.

matz.

Michal_hramrach_Such · 12 June 2006 03:14

Length and indexing do not work very well with utf-8.

~ $ irb -Ku
irb(main):001:0> $KCODE
=> "UTF8"
irb(main):002:0> a='α-ω'
=> "α-ω"
irb(main):003:0> r=/[β-ω]/
=> /[β-ω]/
irb(main):004:0> a.length
=> 5
irb(main):005:0> a[0..0]
=> "\316"
irb(main):006:0> a[0..1]
=> "α"

Fortunately, the regexps work.

irb(main):007:0> a =~ r
=> 3

So you could use a.scan /./ to calculate length or index characters in a string.

irb(main):008:0> a.scan /./
=> ["α", "-", "ω"]
irb(main):009:0> (a.scan /./).length
=> 3

Michal

···

On 6/11/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

Hi,

In message "Re: Which encoding causes fewest problems in Ruby 1.8.2?" > on Sun, 11 Jun 2006 10:12:45 +0900, Jim Smith <nospam@nospam.lan> writes:

>Given the above problems (or rumors), which encoding is recommended for
>use with ruby 1.8.2?
>
>UTF-8
>ISO-8859-1
>ISO-8859-15

String and Regexp handles all of them for most of the cases. But
upper/lower case handling for non ASCII alphabets are not supported.
Use -Ku for UTF-8 and -Kn for ISO-8859-*.

Yukihiro_Matsumoto2 · 12 June 2006 05:37

Hi,

···

In message "Re: Which encoding causes fewest problems in Ruby 1.8.2?" on Mon, 12 Jun 2006 12:14:48 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes:

Length and indexing do not work very well with utf-8.

I know. Operations on characters should based on Regexp.

matz.

Michal_hramrach_Such · 12 June 2006 12:33

I am sure you do know

But it is not what I call 'String handles them all most of the cases'.

Thanks

Michal

···

On 6/12/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

Hi,

In message "Re: Which encoding causes fewest problems in Ruby 1.8.2?" > on Mon, 12 Jun 2006 12:14:48 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes:

>Length and indexing do not work very well with utf-8.

I know. Operations on characters should based on Regexp.

Yukihiro_Matsumoto2 · 12 June 2006 14:28

Hi,

···

In message "Re: Which encoding causes fewest problems in Ruby 1.8.2?" on Mon, 12 Jun 2006 21:33:10 +0900, "Michal Suchanek" <hramrach@centrum.cz> writes:

I know. Operations on characters should based on Regexp.

But it is not what I call 'String handles them all most of the cases'.

OK, then I'd say 'string handles them all most of the case if your
operations are based on Regexp".

matz.

Topic		Replies	Views
Unicode in Ruby now? ruby-talk	11	147	15 August 2002
Unicode roadmap? ruby-talk	17	115	18 June 2006
What character sets are available in Ruby? ruby-talk	16	185	10 March 2003
Unicode ruby-talk	25	194	1 October 2007
Encoding ruby-talk	5	270	21 April 2016

Which encoding causes fewest problems in Ruby 1.8.2?

Related topics