LukeSmithxyz/lb

Needed: Easy removal of fancy unicode characters from strings for url in elegant way

LukeSmithxyz opened this issue · 2 comments

Opening as a note to myself or an invitation to others.

Right now, urls are created from literal titles as follows:

url="$(echo "$title" | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | tr ' ' '-')"

I.e. take the title, delete punctuation, use all lowercase and turn spaces into -. Simple. Works 98% of the time.

But there are some characters that will prevent proper RSS validation, such as emojis.

Even more, I'd like to replace accented characters with non-accented characters to ensure that urls are universally typeable.

Hi Luke,

Something like this might help solving your issue:

url="$(echo "$title" | iconv -cf UTF-8 -t ASCII//TRANSLIT | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | tr ' ' '-')"

When I give it the following text:

"c'est la fête à mémé et la señora n'est plus là! 🙃"

It gives me back:

"cest-la-fete-a-meme-et-la-senora-nest-plus-la-"

Hope it helps a bit ;)

Thanks, that looks really good. I'll play around with it for a bit to make sure it's doing what I need.