Needed: Easy removal of fancy unicode characters from strings for url in elegant way
LukeSmithxyz opened this issue · 2 comments
Opening as a note to myself or an invitation to others.
Right now, urls are created from literal titles as follows:
url="$(echo "$title" | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | tr ' ' '-')"
I.e. take the title, delete punctuation, use all lowercase and turn spaces into -. Simple. Works 98% of the time.
But there are some characters that will prevent proper RSS validation, such as emojis.
Even more, I'd like to replace accented characters with non-accented characters to ensure that urls are universally typeable.
Hi Luke,
Something like this might help solving your issue:
url="$(echo "$title" | iconv -cf UTF-8 -t ASCII//TRANSLIT | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | tr ' ' '-')"
When I give it the following text:
"c'est la fête à mémé et la señora n'est plus là! 🙃"
It gives me back:
"cest-la-fete-a-meme-et-la-senora-nest-plus-la-"
Hope it helps a bit ;)
Thanks, that looks really good. I'll play around with it for a bit to make sure it's doing what I need.