sindresorhus/transliterate

Support languages like Chinese, Japanese, Thai, etc.

saginadir opened this issue · 19 comments

It's a cool library, but i'm fearful that it won't slugify everything.

Chinese characters are just deleted.

slugify('你好'); // results in an empty string

I'm curious, what would be the preferable result in this case?

Reading that answer it appears that there is no single way to slugify Chinese characters. Even converting them to Pinyin it would be very hard to provide the correct conversion, as the last answer in the question you linked to points out.

If you have the translations handy you can add them to your project and then slugify the translation. That would probably be easier than asking slugify to also convert from one language to the other. I believe that's outside the scope of what the library was designed to do

Could we just leave CJK characters unchanged? Like Hello你好 -> hello-你好

Wikipedia URLs contain unicode characters in their paths, so I figured that was OK and I was looking for a lib to do the same for my non-English site.

Could we just leave CJK characters unchanged? Like Hello你好 -> hello-你好

PR welcome for an opt-in options for it.

mark

@sindresorhus Yeah, "Ignores Chinese" is a bad title.

The Japanese get no love either. 残念。。。

@brandonpittman I definitely intend to support languages like Chinese, Japanese, Thai, etc, but it's more work and will take some time. Help is always welcome though.

If anyone wants to work on this, see the feedback given in sindresorhus/slugify#30.

We are currently using https://www.npmjs.com/package/transliteration but I'd love to use this library instead. Even basic/minimal support for Chinese/Japanese characters would be good enough for what we need.

A little tip about the idea of converting Chinese to Pinyin like 你好 to Nihao:

Conversion to Pinyin could never be 100% accurate, but for most cases, they are totally fine to use as slugs.

But, if the generated slugs are expected to be unique, then Pinyin is not good idea. Because it's highly possible that completely different Chinese characters gets converted to the same Pinyin. For example, all & & would be converted to Ni, resulting the same slug.

A little tip about the idea of converting Chinese to Pinyin like 你好 to Nihao:

Conversion to Pinyin could never be 100% accurate, but for most cases, they are totally fine to use as slugs.

But, if the generated slugs are expected to be unique, then Pinyin is not good idea. Because it's highly possible that completely different Chinese characters gets converted to the same Pinyin. For example, all & & would be converted to Ni, resulting the same slug.

as the original author of this issue, this popped up in my email. I read & write in basic Chinese.

I have 2 thoughts about this:

  1. Who said a slug has to be unique? most of the time, slugs are an additional way to represent the text to an ascii only system and it doesn't necessarily has to be reversed back to the utf8 format.

  2. I can float some ideas of making unique slugs if they are needed. For example 你 could be changed into ni3 which is pinyin + tone. 你好 can be ni3hao3 - still a perfectly valid slug. Another way to make it unique is to use stroke number for example: 你好 would be ni7hao6. Still not unique enough? how about a mix of the two: ni37hao36. Using this format, I still can't guarantee uniqueness - because my input can be the same from 2 different sources but it'll be better than just a pure nihao slug.

  1. Who said a slug has to be unique?

I didn`t. I mean for most cases, they are totally fine to be used as slugs unless unique is required which totally depends on actual use cases. The reason I mentioned this is that I noticed that the current slugify process for supported languages produces unique slugs, though it might be just an unintended side effect.

Another way to make it unique is to use stroke number

Is a good idea to reduce the chance of coincidence.

  1. Who said a slug has to be unique?

I didn`t. I mean for most cases, they are totally fine to be used as slugs unless unique is required which totally depends on actual use cases. The reason I mentioned this is that I noticed that the current slugify process for supported languages produces unique slugs, though it might be just an unintended side effect.

Another way to make it unique is to use stroke number

Is a good idea to reduce the chance of coincidence.

What I can say is that I needed slugs for URLs. For example someone writes a post titled “我的冬季“ or something like that. So instead of having a URL with an ID like this: mywebsite.com/post/421321812131 you can make it nicer + nicers for SEO like this: mywebsite.com/post/wo-de-dong-ji. uniqueness can be solved by appending the ID: mywebsite.com/post/wo-de-dong-ji-421321812131

I guess everyone will have a different use case.

I've already started looking into developing a unique solution with strokes and tones. But this will be just for fun and will be a heavy library which most likely won't be front-end friendly.

Can we add other languages like https://en.wikipedia.org/wiki/Tifinagh (for Berber languages) to this issue, or is it only related to Asian languages?

The solution to allow for some untouched unicode ranges (provided in pull request sindresorhus/slugify#30 that was closed) would be enough for my needs, but I understand it can be a bit difficult to use.

Here, the range would be 2D30—2D7F: https://unicode-table.com/en/blocks/tifinagh/

Hey it's the year of 2024 and I think a bit of extra tech can be used.

I made a GPT for slugify-ing any Chinese text for my blog: https://chat.openai.com/g/g-1jvs433lo-slugifyzhuan-jia

Example:
image

I've posted the prompt as a gist here so everyone can reproduce and edit it.

Hope this helps in some way.

Hey it's the year of 2024 and I think a bit of extra tech can be used.

I made a GPT for slugify-ing any Chinese text for my blog: https://chat.openai.com/g/g-1jvs433lo-slugifyzhuan-jia

Example: image

I've posted the prompt as a gist here so everyone can reproduce and edit it.

Hope this helps in some way.

It's an interesting idea indeed :-)

Can we add other languages like https://en.wikipedia.org/wiki/Tifinagh (for Berber languages) to this issue, or is it only related to Asian languages?

The solution to allow for some untouched unicode ranges (provided in pull request sindresorhus/slugify#30 that was closed) would be enough for my needs, but I understand it can be a bit difficult to use.

Here, the range would be 2D30—2D7F: https://unicode-table.com/en/blocks/tifinagh/

This URL has changed to https://symbl.cc/