Potential faster gsub for unaccenting

Question

Potential faster gsub for unaccenting

ioquatix opened this issue 5 years ago · 2 comments

UNACCENT = {
  'Ä'=>'A',  'ä'=>'a',
  'Á'=>'A',  'á'=>'a',
  'É'=>'E',  'é'=>'e',
  'Í'=>'I',  'í'=>'i',
             'ï'=>'i',
  'Ñ'=>'N',  'ñ'=>'n',
  'Ö'=>'O',  'ö'=>'o',
  'Ó'=>'O',  'ó'=>'o',
             'ß'=>'ss',
  'Ü'=>'U',  'ü'=>'u',
  'Ú'=>'U',  'ú'=>'u',
}

PATTERN = Regexp.union(UNACCENT.keys)
def unaccent_gsub(text, mapping)
  text.gsub(PATTERN, mapping)
end

text = "Apples and AÄÁaäá EÉeé IÍiíï NÑnñ OÖÓoöó Ssß UÜÚuüú"

puts unaccent_gsub(text, UNACCENT)

Answer 1 · 2019-08-13T22:08:51.000Z

mapping is provided as an argument while PATTERN is generated from the mapping used. So, in theory, it should probably be moved into the function. It depends on whether mapping is actually constant or not.

In that case, I'd suggest a class instance, to cache the PATTERN.

Answer 2 · 2019-08-14T00:11:42.000Z

Good point. I added your optimization in unaccent_gsub_3b and updated the benchmark and readme. Thanks. Cheers.