I'm having some issues with a range that truncates texts, the below is
a (very) simplified version of the truncate method thats used in rails
(which is where I discovered this):
# this in an utf-8 encoded erb template (a rails "view" in my case)
<% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
hackar cocoa" -%>
<%= text[0..47] %>
<br />
<%= text[0..48] %>
<br />
# notice the 'o' in ingenjor instead of 'ö'
<% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
och hackar cocoa" -%>
<%= othertext[0..47] %>
#produces this (the last character on the first line will display as
a "funny character" in browsers)
Eftersom jag jobbar som kontruktör/ingenjör p?
Eftersom jag jobbar som kontruktör/ingenjör på
Eftersom jag jobbar som kontruktör/ingenjor på
Is this a possible bug in Ruby (1.8.1) or could it be something with
Rails that gets in the way, I can reproduce this across two servers
and in webrick.
I was unable to do this properly in irb, since my terminal (or irb)
would act funny on the öäå's..
# this in an utf-8 encoded erb template (a rails "view" in my case)
<% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
hackar cocoa" -%>
<%= text[0..47] %>
<br />
<%= text[0..48] %>
<br />
# notice the 'o' in ingenjor instead of 'ö'
<% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
och hackar cocoa" -%>
<%= othertext[0..47] %>
#produces this (the last character on the first line will display as
a "funny character" in browsers)
Eftersom jag jobbar som kontruktör/ingenjör p?
Eftersom jag jobbar som kontruktör/ingenjör på
Eftersom jag jobbar som kontruktör/ingenjor på
Is this a possible bug in Ruby (1.8.1) or could it be something with
Rails that gets in the way, I can reproduce this across two servers
and in webrick.
It is a Ruby feature :). Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).
Here is something you can start from:
module UTF8Str
def (*params)
if params.all? { |p| Integer===p } ||
params.size==1 && Range===params[0]
res = self.unpack("U*").(*params)
res = [res] unless Array===res
return res.pack("U*")
end
super
end
end
The thing that has me confused though, is that it's not consistant
since it'll only happen on the first line in the example I gave.
I expand the range a little and it'll pass through untouched. I change
either off the preceeding ö's it'll pass through untouched.
Is this expected behaviour?
-- johan
···
On Sat, 18 Dec 2004 03:20:41 +0900, Carlos <angus@quovadis.com.ar> wrote:
It is a Ruby feature :). Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).
On Sat, 18 Dec 2004 03:20:41 +0900, Carlos <angus@quovadis.com.ar> wrote:
> It is a Ruby feature :). Indices in strings are bytes, not chars. For the
> moment, you must develop your own indexing routines for UTF-8 strings
> (notice that String#[/regex/] works, because regexes are UTF-8 aware).
I see.
The thing that has me confused though, is that it's not consistant
since it'll only happen on the first line in the example I gave.
I expand the range a little and it'll pass through untouched. I change
either off the preceeding ö's it'll pass through untouched.
Well, because "ö".length == 2 (UTF-8 is a multibyte encoding). Your range's
end was falling between the two bytes of the "ö".
Someone on PerlMonks taught me a neat trick. A regex split about
nothing returns an array of one-character strings. It's true for Ruby
as well ... So these indexing routines are really simple.
some_string.split(//).each { |c|
...
}
# or ... some_string.split(//)[5]
It is a Ruby feature :). Indices in strings are bytes, not
chars. For the
···
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).