tyxla/remove-accents

Eszett

Closed this issue · 12 comments

Hi, the Eszett character is incorrectly substituted for a single 's':

$ npm t

> remove-accents@0.4.1 test /var/www/remove-accents
> tape test.js

TAP version 13
# remove accents from string
ok 1 should be equivalent
# do not modify non-accented strings
not ok 2 should be equivalent
  ---
    operator: deepEqual
    expected: |-
      'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz123456789.,:;~`!@#$%^&*()-_=+[]{}\'"|\\<>?/ß'
    actual: |-
      'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz123456789.,:;~`!@#$%^&*()-_=+[]{}\'"|\\<>?/s'
    at: Test.<anonymous> (/var/www/remove-accents/test.js:18:4)
    stack: |-
      Error: should be equivalent
          at Test.assert [as _assert] (/var/www/remove-accents/node_modules/tape/lib/test.js:212:54)
          at Test.bound [as _assert] (/var/www/remove-accents/node_modules/tape/lib/test.js:64:32)
          at Test.deepEqual.Test.deepEquals.Test.isEquivalent.Test.same (/var/www/remove-accents/node_modules/tape/lib/test.js:380:10)
          at Test.bound [as same] (/var/www/remove-accents/node_modules/tape/lib/test.js:64:32)
          at Test.<anonymous> (/var/www/remove-accents/test.js:18:4)
          at Test.bound [as _cb] (/var/www/remove-accents/node_modules/tape/lib/test.js:64:32)
          at Test.run (/var/www/remove-accents/node_modules/tape/lib/test.js:83:10)
          at Test.bound [as run] (/var/www/remove-accents/node_modules/tape/lib/test.js:64:32)
          at Immediate.next [as _onImmediate] (/var/www/remove-accents/node_modules/tape/lib/results.js:71:15)
          at processImmediate [as _immediateCallback] (timers.js:396:17)
  ...

1..2
# tests 2
# pass  1
# fail  1

The expected behaviour should be either:

  • convert 'ß' -> 'ss'
  • leave 'ß' as 'ß'

I looked through the code and was also surprised by the behaviour, maybe it is an artifact of the regular expression?

tyxla commented

Hey there @missinglink! Thanks so much for chiming in.

I'm happy to fix that one - which of the suggestions solutions do you think would make more sense for German? Maybe just leaving it unchanged, because it's not exactly an accented character?

Thanks!

hey @tyxla I think you're right, it's better left unchanged :)

tyxla commented

👍 thanks for confirming, this one is on my TODO list now 😉

as an aside, I noticed in some other work I have been doing that the javascript regex engine doesn't deal with German (and presumably other languages) very well.

for example, using the \b (word boundary anchor) produces unexpected results like this:

'zwiestädter straße'.replace(/\b\w/g, function(l){ return l.toUpperCase(); });
"ZwiestäDter StraßE"

You may even be able to exploit this unusual behavior to detect accented characters.

tyxla commented

This one is addressed by #15. Thanks @missinglink!

Hi @tyxla, just a heads-up that at some point recently a regression was made to this feature.

Back in 2017 we discussed Eszett above and decided that it's best to leave the symbol unmodified rather than converting it to either a single or double s.

I still think that's ideal as Eszett doesn't contain diacritics, however at some point this behavior was changed.

ref: pelias/api#1644

tyxla commented

Hey @missinglink,

Yeah, although folks have been requesting this and in #31 we decided to land it. I wonder if it would be a solution to update tests on your side? Let me know what you think.

Heya, unfortunately it's not as easy as updating the tests.

We're using this lib in Pelias in order to normalize postal address data before ingestion into elasticsearch.

Pelias is widely adopted in the public & private sectors and has a bunch of stars on GitHub, so we need to ensure we don't change search behavior.

The main issue is we end up losing precision, since Straße==Strasse but Session!=Seßion, some information has been lost.

My 2c is that Eszett specifically shouldn't be modified by this lib as it doesn't contain diacritics and the name of the lib is remove-accents, IMO Eszett is not accented.

However I totally understand and respect if the community and maintainers would like to expand the functionality to cover all 'ASCII-folding' tasks (ie. attempts to transliterate input text into the ASCII range).

If that's the direction you'd like to go then we can probably just pin to an older version for now.

tyxla commented

Hey @missinglink, thanks for the extra context - it's really helpful!

Given the rationale you provided, I'd be happy to reconsider change. It's unfortunate that some folks need that and others don't, but additional context always helps us lean in one direction or the other.

Mind filing a PR to address it? I'm happy to ship it and release a new version with it.

Mind filing a PR to address it? I'm happy to ship it and release a new version with it.

I've done this in #44

Thanks, sorry I didn't find the time to get to this yet 🙇