sportdb/sport.db

Potential faster gsub for unaccenting

ioquatix opened this issue · 2 comments

UNACCENT = {
  'Ä'=>'A',  'ä'=>'a',
  'Á'=>'A',  'á'=>'a',
  'É'=>'E',  'é'=>'e',
  'Í'=>'I',  'í'=>'i',
             'ï'=>'i',
  'Ñ'=>'N',  'ñ'=>'n',
  'Ö'=>'O',  'ö'=>'o',
  'Ó'=>'O',  'ó'=>'o',
             'ß'=>'ss',
  'Ü'=>'U',  'ü'=>'u',
  'Ú'=>'U',  'ú'=>'u',
}

PATTERN = Regexp.union(UNACCENT.keys)
def unaccent_gsub(text, mapping)
  text.gsub(PATTERN, mapping)
end

text = "Apples and AÄÁaäá EÉeé IÍiíï NÑnñ OÖÓoöó Ssß UÜÚuüú"

puts unaccent_gsub(text, UNACCENT)

mapping is provided as an argument while PATTERN is generated from the mapping used. So, in theory, it should probably be moved into the function. It depends on whether mapping is actually constant or not.

In that case, I'd suggest a class instance, to cache the PATTERN.

Good point. I added your optimization in unaccent_gsub_3b and updated the benchmark and readme. Thanks. Cheers.