Utf-8 & Range under eruby (possibly Rails) problems

Johan_Sorensen · 17 December 2004 15:42

Hi,

I'm having some issues with a range that truncates texts, the below is
a (very) simplified version of the truncate method thats used in rails
(which is where I discovered this):

# this in an utf-8 encoded erb template (a rails "view" in my case)
<% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
hackar cocoa" -%>
<%= text[0..47] %>
<br />
<%= text[0..48] %>
<br />
# notice the 'o' in ingenjor instead of 'ö'
<% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
och hackar cocoa" -%>
<%= othertext[0..47] %>

#produces this (the last character on the first line will display as
a "funny character" in browsers)

Eftersom jag jobbar som kontruktör/ingenjör p?
Eftersom jag jobbar som kontruktör/ingenjör på
Eftersom jag jobbar som kontruktör/ingenjor på

Is this a possible bug in Ruby (1.8.1) or could it be something with
Rails that gets in the way, I can reproduce this across two servers
and in webrick.
I was unable to do this properly in irb, since my terminal (or irb)
would act funny on the öäå's..

--johan

···

--
Johan Sørensen
Professional Futurist
www.johansorensen.com

Carlos · 17 December 2004 18:20

[Johan Sörensen <johans@gmail.com>, 2004-12-17 16.42 CET]

# this in an utf-8 encoded erb template (a rails "view" in my case)
<% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
hackar cocoa" -%>
<%= text[0..47] %>
<br />
<%= text[0..48] %>
<br />
# notice the 'o' in ingenjor instead of 'ö'
<% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
och hackar cocoa" -%>
<%= othertext[0..47] %>

#produces this (the last character on the first line will display as
a "funny character" in browsers)

Eftersom jag jobbar som kontruktör/ingenjör p?
Eftersom jag jobbar som kontruktör/ingenjör på
Eftersom jag jobbar som kontruktör/ingenjor på

Is this a possible bug in Ruby (1.8.1) or could it be something with
Rails that gets in the way, I can reproduce this across two servers
and in webrick.

It is a Ruby feature :). Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).

Here is something you can start from:

module UTF8Str
        def (*params)
                if params.all? { |p| Integer===p } ||
                   params.size==1 && Range===params[0]
                        res = self.unpack("U*").(*params)
                        res = [res] unless Array===res
                        return res.pack("U*")
                end
                super
        end
end

a="áéióúü"
a.extend UTF8Str

puts a[0], a[1], a[2], a[3], a[4], a[1,2], a[1..2], a[-1]

Good luck.

···

--

Johan_Sorensen · 17 December 2004 18:34

I see.

The thing that has me confused though, is that it's not consistant
since it'll only happen on the first line in the example I gave.
I expand the range a little and it'll pass through untouched. I change
either off the preceeding ö's it'll pass through untouched.

Is this expected behaviour?

-- johan

···

On Sat, 18 Dec 2004 03:20:41 +0900, Carlos <angus@quovadis.com.ar> wrote:

It is a Ruby feature :). Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).

Carlos · 17 December 2004 18:54

[Johan Sörensen <johans@gmail.com>, 2004-12-17 19.34 CET]

···

On Sat, 18 Dec 2004 03:20:41 +0900, Carlos <angus@quovadis.com.ar> wrote:
> It is a Ruby feature :). Indices in strings are bytes, not chars. For the
> moment, you must develop your own indexing routines for UTF-8 strings
> (notice that String#[/regex/] works, because regexes are UTF-8 aware).

I see.

The thing that has me confused though, is that it's not consistant
since it'll only happen on the first line in the example I gave.
I expand the range a little and it'll pass through untouched. I change
either off the preceeding ö's it'll pass through untouched.

Well, because "ö".length == 2 (UTF-8 is a multibyte encoding). Your range's
end was falling between the two bytes of the "ö".

--

Michael_DeHaan · 17 December 2004 20:06

Someone on PerlMonks taught me a neat trick. A regex split about
nothing returns an array of one-character strings. It's true for Ruby
as well ... So these indexing routines are really simple.

some_string.split(//).each { |c|
...
}

# or ... some_string.split(//)[5]

It is a Ruby feature :). Indices in strings are bytes, not

chars. For the

···

moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).

Topic		Replies	Views
Malformed UTF-8? ruby-talk	4	230	11 March 2005
Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files ruby-talk	2	141	18 November 2010
Ruby unicode/string explosion (0xFF in utf-8) ruby-talk	2	421	12 December 2010
String#chop chops last byte, not char ruby-talk	2	149	23 April 2008
[ENCODING] UTF8 hell ruby-talk	14	702	24 February 2010

Utf-8 & Range under eruby (possibly Rails) problems

Related topics