I'm using a method that i found at the acts as ferret site:
http://projects.jkraemer.net/acts_as_ferret/#UTF-8support
which is intended to strip accents out of strings, turning for example
"La Bohème" into "La Boheme". Here's the method:
def strip_diacritics(s)
# latin1 subset only
s.tr("ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ",
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
gsub(/Æ/, "AE").
gsub(/Ð/, "Eth").
gsub(/Þ/, "THORN").
gsub(/ß/, "ss").
gsub(/æ/, "ae").
gsub(/ð/, "eth").
gsub(/þ/, "thorn")
end
However, it's breaking for me: è is turned into "yy". I think this is
to do with the number of bytes used: the first string passed to tr()
uses 2 bytes per character while the second uses 1 byte per character:
"ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ".size
=> 110
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy".size
=> 55
Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.
thanks, max
···
--
Posted via http://www.ruby-forum.com/.
I'm using a method that i found at the acts as ferret site:
http://projects.jkraemer.net/acts_as_ferret/#UTF-8support
which is intended to strip accents out of strings, turning for example
"La Bohème" into "La Boheme". Here's the method:
def strip_diacritics(s)
# latin1 subset only
s.tr("ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ",
"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
gsub(/Æ/, "AE").
gsub(/Ð/, "Eth").
gsub(/Þ/, "THORN").
gsub(/ß/, "ss").
gsub(/æ/, "ae").
gsub(/ð/, "eth").
gsub(/þ/, "thorn")
end
With ruby 1.9 your code works fine without modifications, with ruby 1.8 and
it's support for unicode (or lack of thereof) it might be quite a problem to
get it working.
Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.
UTF-8 is variable length encoding, the first half of ascii (which includes
a-zA-Z) is not encoded at all (=1 byte), anything other is encoded as 2-4
byte chars. Both of the strings are therefore valid UTF-8, but ruby 1.8's tr
can't operate on character level, only on byte level.
Jan
···
On Tuesday 22 April 2008 11:46:23 Max Williams wrote:
Max Williams wrote:
However, it's breaking for me: è is turned into "yy".
It works if you require 'jcode' first.
HTH,
Sebastian
···
--
NP: Depeche Mode - The Things You Said
Jabber: sepp2k@jabber.org
ICQ: 205544826
Jan Dvorak wrote:
With ruby 1.9 your code works fine without modifications, with ruby 1.8
and
it's support for unicode (or lack of thereof) it might be quite a
problem to
get it working.
ah...i'm a bit scared to change our project over to ruby 1.9 (i didn't
know there was a 1.9) to solve this problem. I ended up just picking
the most commonly used accents and doing individual gsubs on the strings
to swap them out. Feels dirty but it works.
Thanks a lot for the info!
max
···
--
Posted via http://www.ruby-forum.com/\.
Sebastian Hungerecker wrote:
Max Williams wrote:
However, it's breaking for me: è is turned into "yy".
It works if you require 'jcode' first.
HTH,
Sebastian
Perfect, thanks! That's much more palatable than upgrading ruby.
cheers
max
···
--
Posted via http://www.ruby-forum.com/\.