Accents and String#tr

I wrote this method

   def self.normalize_for_sorting(s)
     return nil unless s
     norm = s.downcase
     norm.tr!('ÁÉÍÓÚ', 'aeiou')
     norm.tr!('ÀÈÌÒÙ', 'aeiou')
     norm.tr!('ÄËÏÖÜ', 'aeiou')
     norm.tr!('ÂÊÎÔÛ', 'aeiou')
     norm.tr!('áéíóú', 'aeiou')
     norm.tr!('àèìòù', 'aeiou')
     norm.tr!('äëïöü', 'aeiou')
     norm.tr!('âêîôû', 'aeiou')
     norm
   end

to normalize strings for sorting. This script is UTF-8, everything is UTF-8 in my application, $KCODE is 'u'.

But it does not work, examples:

    Andrés -> andruos
    López -> luupez
    Pérez -> puorez

I tried to "force" it with Iconv.conv('UTF-8', 'ASCII', 'aeiou') to no avail. Any ideas?

-- fxn

Dňa Utorok 21 Február 2006 00:28 Xavier Noria napísal:

I wrote this method

   def self.normalize_for_sorting(s)
     return nil unless s
     norm = s.downcase
     norm.tr!('ÁÉÍÓÚ', 'aeiou')
     norm.tr!('ÀÈÌÒÙ', 'aeiou')
     norm.tr!('ÄËÏÖÜ', 'aeiou')
     norm.tr!('ÂÊÎÔÛ', 'aeiou')
     norm.tr!('áéíóú', 'aeiou')
     norm.tr!('àèìòù', 'aeiou')
     norm.tr!('äëïöü', 'aeiou')
     norm.tr!('âêîôû', 'aeiou')
     norm
   end

to normalize strings for sorting. This script is UTF-8, everything is
UTF-8 in my application, $KCODE is 'u'.

But it does not work, examples:

    Andrés -> andruos
    López -> luupez
    Pérez -> puorez

I tried to "force" it with Iconv.conv('UTF-8', 'ASCII', 'aeiou') to
no avail. Any ideas?

-- fxn

Apparently, not all String methods were created equal:

david@chello082119107152:~$ irb
irb(main):001:0> $KCODE = 'u'
=> "u"
irb(main):002:0> require 'jcode'
=> true
irb(main):003:0> "Andrés".tr("áéíóú", "aeiou")
=> "Andres"

jcode to the rescue!

David Vallner

Xavier Noria wrote:

I wrote this method

  def self.normalize_for_sorting(s)
    return nil unless s
    norm = s.downcase
    norm.tr!('ÁÉÍÓÚ', 'aeiou')
    norm.tr!('ÀÈÌÒÙ', 'aeiou')
    norm.tr!('ÄËÏÖÜ', 'aeiou')
    norm.tr!('ÂÊÎÔÛ', 'aeiou')
    norm.tr!('áéíóú', 'aeiou')
    norm.tr!('àèìòù', 'aeiou')
    norm.tr!('äëïöü', 'aeiou')
    norm.tr!('âêîôû', 'aeiou')
    norm
  end

to normalize strings for sorting. This script is UTF-8, everything is UTF-8 in my application, $KCODE is 'u'.

But it does not work, examples:

   Andrés -> andruos
   López -> luupez
   Pérez -> puorez

I tried to "force" it with Iconv.conv('UTF-8', 'ASCII', 'aeiou') to no avail. Any ideas?

-- fxn

Hi,

My guess is that the "tr" method treats its arguments as a string of
bytes. And because characters with accents need more than 1 byte in
UTF-8, #tr doesn't do what you would expect it to. (It's not even tr's
fault, how is it supposed to know that two bytes actually represent a
single character?)

The solution is not to use #tr!, but #gsub!. It isn't as short, but at
least it's right :wink:

   norm.gsub!('ä', 'a')
   norm.gsub!('ë', 'e')
   # and so on...

And because that is against DRY (Don't Repeat Yourself), I would
recommend storing the mapping as a hash:

   accents = { 'ä' => 'a', 'ë' => 'e', ... }
   accents.each do |accent, replacement|
     norm.gsub!(accent, replacement)
   end

Regards,
   Robin Stocker