simov/slugify

Remove option breaks main slugify functionality

khromov opened this issue · 7 comments

Reproduction script:

const slugify = require('slugify');

const fileName = decodeURIComponent('a%CC%8Aa%CC%88o%CC%88-123'); // åäö-123
const fileName2 = decodeURIComponent('%0A%C3%A5%C3%A4%C3%B6-123'); // åäö-123

const withoutRemove = slugify(fileName);

const withRemove = slugify(fileName, {
  remove: /[*+~.()'"!:@]/g,
});

const withRemoveAlternativeEncoding = slugify(fileName2, {
  remove: /[*+~.()'"!:@]/g,
});

console.log(withoutRemove);
console.log(withRemove);
console.log(withRemoveAlternativeEncoding);

Result

aao-123
åäö-123 # Broken
aao-123

I can't reproduce this when typing the "åäö" characters on my keyboard, however if I paste them from my console or a browser URL then this behaviour can be reproduced. I suspect there is a problem with different code points similar to this:
ipython/ipython#522

I've updated the reproduction script to show the alternative encoding that causes the issue, I believe it's Safari that causes this when it fetches urls / uploads files.

Trott commented

I'm able to reproduce this all the way back to the introduction of the options object in 325bcaa so this isn't anything that was once working differently and then changed. Just in case that tidbit is helpful at all. (Not sure that it is.)

simov commented

Yes, it is helpful, no need to delete comments. Thanks.

simov commented

With encoding one you get something like this:

| a  | ̊  | a  | ̈  | o  | ̈  | -  | 1  | 2  | 3  |

And with encoding two you get the expected characters:

|  |  |  å |  ä |  ö |  - |  1 |  2 |  3 |

Note that å, ä, ö are part of the built-in charmap and so they get replaced correctly.

When using encoding one with the default remove regex you are left with the impression that the module works correctly, but that's only because the default regex removes anything outside of \w and so in this case you are left with the correct characters by chance.

When using encoding one slugify cannot replace the characters with the ones in the charmap, and when you add your custom regex on top that does not filter anything outside of \w you endup with nothing being replaced.

👋 @simov

I understand the problem. It appears this is called a "Decomposed character sequence" and is an alternate way of writing certain characters.

There is a good visual example here:
https://superpixel.ch/bugreport/safari-diacritic/

I saw some discussion regarding this on SO, pointing out Chrome / Safari as the culprit.

It feels like the current solution will work just fine in a majority of cases, since it separates the main character (which won't be filtered) from all the diacritics and such (that will be filtered). But it would be nice if it could somehow work with remove.

I noticed there's a String.normalize function that seems to do what we want. Maybe one option could be to apply it before any other code, but it would be hard to assess the implication of it on existing codebases.

Example:

Before normalization (å character only):

console.log(Array.from(decodeURIComponent('a%CC%8A'))
  .map((v) => v.codePointAt(0).toString(16))
  .map((hex) => "\\u" + "0000".substring(0, 4 - hex.length) + hex));

> Array [ "\\u0061", "\\u030a" ] // a,  ̊

After normalization:

console.log(Array.from(decodeURIComponent('a%CC%8A').normalize())
  .map((v) => v.codePointAt(0).toString(16))
  .map((hex) => "\\u" + "0000".substring(0, 4 - hex.length) + hex));

> Array [ "\\u00e5" ] // å - great success!
simov commented

Sounds good to me. Seems to be pretty safe to use since it was introduced around 6 years ago.

simov commented

Published in v1.5.0 2df63f9