Support languages like Chinese, Japanese, Thai, etc.

Question

Support languages like Chinese, Japanese, Thai, etc.

saginadir opened this issue 6 years ago · 19 comments

It's a cool library, but i'm fearful that it won't slugify everything.

Chinese characters are just deleted.

slugify('你好'); // results in an empty string

lizhengnacl commented 5 years ago

mark

Answer 1 · 2018-05-08T10:00:22.000Z

I'm curious, what would be the preferable result in this case?

Answer 2 · 2018-05-09T11:59:35.000Z

It is possible to convert Chinese to pinyin for example:
https://stackoverflow.com/questions/4813086/how-to-convert-chinese-characters-to-pinyin

Answer 3 · 2018-06-11T01:58:44.000Z

Reading that answer it appears that there is no single way to slugify Chinese characters. Even converting them to Pinyin it would be very hard to provide the correct conversion, as the last answer in the question you linked to points out.

If you have the translations handy you can add them to your project and then slugify the translation. That would probably be easier than asking slugify to also convert from one language to the other. I believe that's outside the scope of what the library was designed to do

Answer 4 · 2019-02-13T05:12:40.000Z

Could we just leave CJK characters unchanged? Like Hello你好 -> hello-你好

Answer 5 · 2019-03-27T18:06:58.000Z

Wikipedia URLs contain unicode characters in their paths, so I figured that was OK and I was looking for a lib to do the same for my non-English site.

Answer 6 · 2019-03-27T18:14:32.000Z

Could we just leave CJK characters unchanged? Like Hello你好 -> hello-你好

PR welcome for an opt-in options for it.

Answer 7 · 2020-02-18T03:22:10.000Z

@sindresorhus Yeah, "Ignores Chinese" is a bad title.

The Japanese get no love either. 残念。。。

Answer 8 · 2020-02-18T06:30:49.000Z

@brandonpittman I definitely intend to support languages like Chinese, Japanese, Thai, etc, but it's more work and will take some time. Help is always welcome though.

Answer 9 · 2020-05-07T06:00:17.000Z

If anyone wants to work on this, see the feedback given in sindresorhus/slugify#30.

Answer 10 · 2021-09-15T06:36:32.000Z

We are currently using https://www.npmjs.com/package/transliteration but I'd love to use this library instead. Even basic/minimal support for Chinese/Japanese characters would be good enough for what we need.

Answer 11 · 2021-10-22T14:54:06.000Z

A little tip about the idea of converting Chinese to Pinyin like 你好 to Nihao:

Conversion to Pinyin could never be 100% accurate, but for most cases, they are totally fine to use as slugs.

But, if the generated slugs are expected to be unique, then Pinyin is not good idea. Because it's highly possible that completely different Chinese characters gets converted to the same Pinyin. For example, all 你 & 泥 & 腻 would be converted to Ni, resulting the same slug.

Answer 12 · 2021-10-24T06:52:25.000Z

A little tip about the idea of converting Chinese to Pinyin like 你好 to Nihao:

Conversion to Pinyin could never be 100% accurate, but for most cases, they are totally fine to use as slugs.

But, if the generated slugs are expected to be unique, then Pinyin is not good idea. Because it's highly possible that completely different Chinese characters gets converted to the same Pinyin. For example, all 你 & 泥 & 腻 would be converted to Ni, resulting the same slug.

as the original author of this issue, this popped up in my email. I read & write in basic Chinese.

I have 2 thoughts about this:

Who said a slug has to be unique? most of the time, slugs are an additional way to represent the text to an ascii only system and it doesn't necessarily has to be reversed back to the utf8 format.
I can float some ideas of making unique slugs if they are needed. For example 你 could be changed into ni3 which is pinyin + tone. 你好 can be ni3hao3 - still a perfectly valid slug. Another way to make it unique is to use stroke number for example: 你好 would be ni7hao6. Still not unique enough? how about a mix of the two: ni37hao36. Using this format, I still can't guarantee uniqueness - because my input can be the same from 2 different sources but it'll be better than just a pure nihao slug.

Answer 13 · 2021-10-24T07:38:29.000Z

Who said a slug has to be unique?

I didn`t. I mean for most cases, they are totally fine to be used as slugs unless unique is required which totally depends on actual use cases. The reason I mentioned this is that I noticed that the current slugify process for supported languages produces unique slugs, though it might be just an unintended side effect.

Another way to make it unique is to use stroke number

Is a good idea to reduce the chance of coincidence.

Answer 14 · 2021-10-24T07:49:30.000Z

Who said a slug has to be unique?

I didn`t. I mean for most cases, they are totally fine to be used as slugs unless unique is required which totally depends on actual use cases. The reason I mentioned this is that I noticed that the current slugify process for supported languages produces unique slugs, though it might be just an unintended side effect.

Another way to make it unique is to use stroke number

Is a good idea to reduce the chance of coincidence.

What I can say is that I needed slugs for URLs. For example someone writes a post titled “我的冬季“ or something like that. So instead of having a URL with an ID like this: mywebsite.com/post/421321812131 you can make it nicer + nicers for SEO like this: mywebsite.com/post/wo-de-dong-ji. uniqueness can be solved by appending the ID: mywebsite.com/post/wo-de-dong-ji-421321812131

I guess everyone will have a different use case.

I've already started looking into developing a unique solution with strokes and tones. But this will be just for fun and will be a heavy library which most likely won't be front-end friendly.

Answer 15 · 2022-01-06T23:03:15.000Z

Can we add other languages like https://en.wikipedia.org/wiki/Tifinagh (for Berber languages) to this issue, or is it only related to Asian languages?

The solution to allow for some untouched unicode ranges (provided in pull request sindresorhus/slugify#30 that was closed) would be enough for my needs, but I understand it can be a bit difficult to use.

Here, the range would be 2D30—2D7F: https://unicode-table.com/en/blocks/tifinagh/

Answer 16 · 2024-01-03T03:47:37.000Z

Hey it's the year of 2024 and I think a bit of extra tech can be used.

I made a GPT for slugify-ing any Chinese text for my blog: https://chat.openai.com/g/g-1jvs433lo-slugifyzhuan-jia

Example:

I've posted the prompt as a gist here so everyone can reproduce and edit it.

Hope this helps in some way.

Answer 17 · 2024-01-04T09:53:15.000Z

Hey it's the year of 2024 and I think a bit of extra tech can be used.

I made a GPT for slugify-ing any Chinese text for my blog: https://chat.openai.com/g/g-1jvs433lo-slugifyzhuan-jia

Example:

I've posted the prompt as a gist here so everyone can reproduce and edit it.

Hope this helps in some way.

It's an interesting idea indeed :-)

Answer 18 · 2024-04-02T14:55:49.000Z

Can we add other languages like https://en.wikipedia.org/wiki/Tifinagh (for Berber languages) to this issue, or is it only related to Asian languages?

The solution to allow for some untouched unicode ranges (provided in pull request sindresorhus/slugify#30 that was closed) would be enough for my needs, but I understand it can be a bit difficult to use.

Here, the range would be 2D30—2D7F: https://unicode-table.com/en/blocks/tifinagh/

This URL has changed to https://symbl.cc/