Alphabets Benchmarks - How many ways to unaccent a text string? Turn AÄÁaäá into AAAaaa. And the winner is...

Hello,

  let's try out half a dozen ways to unaccent a text string? [1]

  The challenge - What's the fastest way to turn `AÄÁaäá EÉeé IÍiíï
NÑnñ OÖÓoöó Ssß UÜÚuüú`
  into `AAAaaa EEee IIiii NNnn OOOooo Ssss UUUuuu`?

  Let's benchmark and the winner (so far) is... Spoiler: `gsub` .

    NON_ALPHA_CHAR_REGEX = /[^A-Za-z0-9 ]/ # use/try regex constant
for speed-up
    def unaccent_gsub( text, mapping )
      text.gsub( NON_ALPHA_CHAR_REGEX ) do |ch|
        mapping[ch] || ch
      end
    end

  Can you find a faster way? Show us.

  Happy data (and text) wrangling with ruby. Cheers. Prost.

[1]: https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark

You missed that String#gsub can take a Hash as its second argument

def unaccent_gsub_v3( text, mapping )
  text.gsub( NON_ALPHA_CHAR_REGEX, mapping )
end

-Rob

P.S. Pull request in your repo.

···

On 2019-Aug-13, at 16:17 , Gerald Bauer <gerald.bauer@gmail.com> wrote:

Hello,

let's try out half a dozen ways to unaccent a text string? [1]

The challenge - What's the fastest way to turn `AÄÁaäá EÉeé IÍiíï
NÑnñ OÖÓoöó Ssß UÜÚuüú`
into `AAAaaa EEee IIiii NNnn OOOooo Ssss UUUuuu`?

Let's benchmark and the winner (so far) is... Spoiler: `gsub` .

   NON_ALPHA_CHAR_REGEX = /[^A-Za-z0-9 ]/ # use/try regex constant
for speed-up
   def unaccent_gsub( text, mapping )
     text.gsub( NON_ALPHA_CHAR_REGEX ) do |ch|
       mapping[ch] || ch
     end
   end

Can you find a faster way? Show us.

Happy data (and text) wrangling with ruby. Cheers. Prost.

[1]: https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Gerald Bauer wrote:

  let's try out half a dozen ways to unaccent a text string? [1]

  The challenge - What's the fastest way to turn `AÄÁaäá EÉeé
IÍiíï
NÑnñ OÖÓoöó Ssß UÜÚuüú`
  into `AAAaaa EEee IIiii NNnn OOOooo Ssss UUUuuu`?

[1]: https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark

For the single-character mapping there's String#tr

text=>AÄÁaäá ...:
                           user system total real
each_char 1.219647 0.021196 1.240843 ( 1.587247)
each_char_v2 1.054844 0.011123 1.065967 ( 1.312583)
each_char_reduce 1.372809 0.010580 1.383389 ( 1.839789)
each_char_reduce_v2 1.226152 0.003887 1.230039 ( 1.644493)
gsub 1.124067 0.005926 1.129993 ( 1.399212)
gsub_v2 0.949538 0.003917 0.953455 ( 1.158131)
gsub_v3 0.804060 0.009833 0.813893 ( 1.009054)
scan 1.879271 0.006998 1.886269 ( 2.305612)
iconv 0.192035 0.001944 0.193979 ( 0.224324)
tr 0.154944 0.000978 0.155922 ( 0.224245)
tr_v2 0.095632 0.002961 0.098593 ( 0.120770)

···

------------

text=>Aa...:
                           user system total real
each_char 0.332079 0.002956 0.335035 ( 0.430095)
each_char_v2 0.336198 0.002921 0.339119 ( 0.411377)
each_char_reduce 0.379635 0.003936 0.383571 ( 0.474561)
each_char_reduce_v2 0.386494 0.003990 0.390484 ( 0.488094)
gsub 0.034031 0.000004 0.034035 ( 0.039017)
gsub_v2 0.033728 0.000000 0.033728 ( 0.037283)
gsub_v3 0.035162 0.000000 0.035162 ( 0.058595)
scan 0.566857 0.000904 0.567761 ( 0.679017)
iconv 0.032989 0.000842 0.033831 ( 0.036642)
tr 0.079383 0.000004 0.079387 ( 0.135033)
tr_v2 0.020683 0.000986 0.021669 ( 0.023377)

  require 'iconv'
  def unaccent_iconv( text, mapping )
    Iconv.iconv('ascii//translit//ignore', 'utf-8', text)
  end
  #=> ["AAAaaa ... UUUuuu"]

  def unaccent_tr( text, mapping )
    text.tr( UNACCENT.keys.join, UNACCENT.values.join )
  end
  #=> "AAAaaa ... UsuuUU"

  TR_KEYS = UNACCENT.keys .join
  TR_VALS = UNACCENT.values.join
  def unaccent_tr_v2( text, mapping )
    text.tr( TR_KEYS, TR_VALS )
  end
  #=> "AAAaaa ... UsuuUU"

Hello,

  Great thanks. Today I learned that String#gsub can take a Hash as
its second argument. I added your unaccent function.

   About tr - that's great too and I guess that's as fast as you can
get - but unaccent will not work with ligatures e.g. 'æ'=>'ae', 'ß' =>
'ss' or german umlaut transliteration 'ä' => 'ae', 'ö' => 'oe' etc.

    Some more new examples include - to quote from the updated readme [1]:

Samuel Williams writes in with one more optimization.
Why not replace the `NON_ALPHA_CHAR_REGEX`, that is, `/[^A-Za-z0-9 ]/`
with a regex matching only known accented chars?

UNACCENT_REGEX = Regexp.union( UNACCENT.keys )
def unaccent_gsub_v3b( text, mapping=UNACCENT, regex=UNACCENT_REGEX )
  text.gsub( regex, mapping)
end

Hold on. Let's add some more optimizations to the humble `each_char`
version too.
For all 7-bit (less than 0x7F) unicode latin basic (also known as ascii)
char(acter)s no mapping (ever) needed. Let's try:

def unaccent_each_char_v2_7bit( text, mapping )
  buf = String.new
  text.each_char do |ch|
    buf <<   if ch.ord < 0x7F
               ch
             else
               mapping[ch] || ch
             end
  end
  buf
end

Maybe the mapping lookup using an array index by an integer number
is faster than hash mapping lookup by single-character string?
Let's try:

UNACCENT_FASTER = UNACCENT.reduce( [] ) do |ary,(ch,value)|
  ary[ ch.ord ] = value
  ary
end

def unaccent_each_char_v2_7bit_faster( text, mapping_faster=UNACCENT_FASTER )
  buf = String.new
  text.each_char do |ch|
    buf <<  if ch.ord < 0x7F
               ch
            else
               mapping_faster[ ch.ord ] || ch
            end
  end
  buf
end

     Voila. And the winner is... Can you find a faster way? Show us.

   Happy data (and text) wrangling with ruby. Cheers. Prost.

[1] https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark

1 Like