Removing diacritical marks

Hello Rubyists,

I would like to remove the accents marks (a.k.a diacritical marks) from a
String. Assuming "line" is a String, this gets most of them:

    line.gsub!(/[ÀÁÂÃÄ]/,"A")
    line.gsub!(/[âãäàá]/,"a")
    line.gsub!(/[ÈÉÊË]/,"E")
    line.gsub!(/[êëèé]/,"e")
    line.gsub!(/[ÌÍÎÏ]/,"I")
    line.gsub!(/[îïìí]/,"i")
    line.gsub!(/[ÒÓÔÕÖ]/,"O")
    line.gsub!(/[ôõöòó]/,"o")
    line.gsub!(/[ÙÚÛÜ]/,"U")
    line.gsub!(/[ûüùú]/,"u")
    line.gsub!(/Ý/,"Y")
    line.gsub!(/ý/,"y")
    line.gsub!(/ñ/,"n")

Is there an easier/better way to do this?

Paul Barry wrote:

I would like to remove the accents marks (a.k.a diacritical marks) from a
String. Assuming "line" is a String, this gets most of them:

    line.gsub!(/[ÀÁÂÃÄ]/,"A")
...
Is there an easier/better way to do this?

Yes. There's a potential problem with your way: if the accented characters
are more than one byte (i.e. in any character set other than ASCII) each
byte will be replaced with an A: "À" => "AA".

This is safer: line.gsub!(/À|Á|Â|Ã|Ä]/,"A")

I translated a method to do this from PHP earlier this year:
http://tinyurl.com/q8hlg [Google Groups]

Cheers,
Dave

I translated a method to do this from PHP earlier this year:
http://tinyurl.com/q8hlg [Google Groups]

Here's a simpler version (hard-coded for UTF-8; it would need some
tweaking for other encodings). It has a side effect of transliterating
punctuation to ASCII as well, which may or may not be desirable.

Paul

···

----

$KCODE = 'u'
require 'iconv'

class String
  def strip_diacritics
    self.gsub(/[^\x20-\x7f]/){
      Iconv.iconv('us-ascii//IGNORE//TRANSLIT', 'utf-8',
$&)[0].sub(/^[\^`'"~](?=[a-z])/i, '')
    }
  end
end

require 'test/unit'
class TestStripDiacritics < Test::Unit::TestCase

  def test_upper_case
    assert_equal('AAAAA', 'ÀÁÂÃÄ'.strip_diacritics)
    assert_equal('EEEE', 'ÈÉÊË'.strip_diacritics)
    assert_equal('IIII', 'ÌÍÎÏ'.strip_diacritics)
    assert_equal('OOOOO', 'ÒÓÔÕÖ'.strip_diacritics)
    assert_equal('UUUU', 'ÙÚÛÜ'.strip_diacritics)
    assert_equal('Y', 'Ý'.strip_diacritics)
    assert_equal('N', 'Ñ'.strip_diacritics)
  end

  def test_lower_case
    assert_equal('aaaaa', 'âãäàá'.strip_diacritics)
    assert_equal('eeee', 'êëèé'.strip_diacritics)
    assert_equal('iiii', 'îïìí'.strip_diacritics)
    assert_equal('ooooo', 'ôõöòó'.strip_diacritics)
    assert_equal('uuuu', 'ûüùú'.strip_diacritics)
    assert_equal('y', 'ý'.strip_diacritics)
    assert_equal('n', 'ñ'.strip_diacritics)
  end

  def test_words
    assert_equal('Internationalizaetion',
'Iñtërnâtiônàlizætiøn'.strip_diacritics)
  end

  def test_punctuation
    assert_equal('-', '—'.strip_diacritics)
    assert_equal("''", "''".strip_diacritics)
  end
end