Character substitution using tr()

I'm using a method that i found at the acts as ferret site:

http://projects.jkraemer.net/acts_as_ferret/#UTF-8support

which is intended to strip accents out of strings, turning for example
"La Bohème" into "La Boheme". Here's the method:

def strip_diacritics(s)
    # latin1 subset only
    s.tr("ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ",
         "AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
      gsub(/Æ/, "AE").
      gsub(/Ð/, "Eth").
      gsub(/Þ/, "THORN").
      gsub(/ß/, "ss").
      gsub(/æ/, "ae").
      gsub(/ð/, "eth").
      gsub(/þ/, "thorn")
  end

However, it's breaking for me: è is turned into "yy". I think this is
to do with the number of bytes used: the first string passed to tr()
uses 2 bytes per character while the second uses 1 byte per character:

"ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ".size
  => 110

"AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy".size
  => 55

Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.

thanks, max

···

--
Posted via http://www.ruby-forum.com/.

I'm using a method that i found at the acts as ferret site:

http://projects.jkraemer.net/acts_as_ferret/#UTF-8support

which is intended to strip accents out of strings, turning for example
"La Bohème" into "La Boheme". Here's the method:

def strip_diacritics(s)
    # latin1 subset only
    s.tr("ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ",
         "AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
      gsub(/Æ/, "AE").
      gsub(/Ð/, "Eth").
      gsub(/Þ/, "THORN").
      gsub(/ß/, "ss").
      gsub(/æ/, "ae").
      gsub(/ð/, "eth").
      gsub(/þ/, "thorn")
  end

With ruby 1.9 your code works fine without modifications, with ruby 1.8 and
it's support for unicode (or lack of thereof) it might be quite a problem to
get it working.

Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn't make any
difference.

UTF-8 is variable length encoding, the first half of ascii (which includes
a-zA-Z) is not encoded at all (=1 byte), anything other is encoded as 2-4
byte chars. Both of the strings are therefore valid UTF-8, but ruby 1.8's tr
can't operate on character level, only on byte level.

Jan

···

On Tuesday 22 April 2008 11:46:23 Max Williams wrote:

Max Williams wrote:

However, it's breaking for me: è is turned into "yy".

It works if you require 'jcode' first.

HTH,
Sebastian

···

--
NP: Depeche Mode - The Things You Said
Jabber: sepp2k@jabber.org
ICQ: 205544826

Jan Dvorak wrote:

With ruby 1.9 your code works fine without modifications, with ruby 1.8
and
it's support for unicode (or lack of thereof) it might be quite a
problem to
get it working.

ah...i'm a bit scared to change our project over to ruby 1.9 (i didn't
know there was a 1.9) to solve this problem. I ended up just picking
the most commonly used accents and doing individual gsubs on the strings
to swap them out. Feels dirty but it works.

Thanks a lot for the info!
max

···

--
Posted via http://www.ruby-forum.com/\.

Sebastian Hungerecker wrote:

Max Williams wrote:

However, it's breaking for me: è is turned into "yy".

It works if you require 'jcode' first.

HTH,
Sebastian

Perfect, thanks! That's much more palatable than upgrading ruby.

cheers
max

···

--
Posted via http://www.ruby-forum.com/\.