Searching does not work properly for CJK ideographs

Question

Searching does not work properly for CJK ideographs

nevikw39 opened this issue 5 years ago · 9 comments

Hello,

Telegram's searching ability is poor when it comes to Chinese-Japanese-Korean ideographs, which leads to difficulty in promoting it around Taiwan.

I tried to find out the cause. I took a look in MessagesDb.cpp and find that Telegram uses SQLite to restore messages and FTS5 module to make a search table.

And that is the point. FTS5 splits string into phrases, putting them into hash table. Suppose there is a text "Telegram search". Only "Telegram" and "search" would match the text, whereas either "Tele" or "a" would get no result. Unfortunately, Chinese characters are all categorized into "Letter", which is considered to be token. Hence, the whole Chinese text like "我好想要中文搜尋", containing consecutive Chinese chars without any delimiter, would be viewed as a single phrase. That is, none of "想要", "中文" or "搜尋" would match the result.

I have two ideas. The simple one, we can insert invisible separator such as '\a' between every Chinese char. The other one, we may implement a custom tokenizer.

Nonetheless, I can hardly realize what MessagesDb.cpp works. Actually I don't know how Telegram performs search tasks or how search_id is generated.

So, how can we solve this problem? I would like to make my efforts to contribute to Telegram.

Thanks.

levlam commented 5 years ago

No.

😕2
👀1

Answer 1 · 2020-04-14T16:48:39.000Z

You have found a client-side search, which is enabled only for secret chat messages. The best way to improve it is to contibute directly to SQLite's FTS extension.
Search for messages in all other chats is done server-side, so there is no way to improve it on TDLib's side.

Answer 2 · 2020-04-15T00:21:17.000Z

OK I see.

So, there is no way to check out Telegram server side code?

Answer 3 · 2020-06-13T23:54:33.000Z

The search of telegram is based on "word", and the interval of "word" is punctuation or space.
This is an English based search method, which is very convenient for English search. For example, "hello" can't be found by "he", and "hello" must be used. This is in line with the English context. When I want to find "he" messages, I don't want to see "hello" messages. But this way is not convenient for Chinese and other languages. Chinese is based on Chinese character

https://congcong0806.github.io/2019/11/04/TelegramSearch/

Answer 4 · 2022-02-11T22:32:48.000Z

any updates on this? cannot effectively searching CJK characters is a huge pain using Telegram

Answer 5 · 2022-04-09T17:43:19.000Z

This is a huge trouble for people who use CJK language, but telegram doesn't seem to plan to solve the problem, don't know why? Because telegram users hardly use CJK language, or is it technically not easy to achieve?

Answer 6 · 2022-10-03T03:50:07.000Z

This is a huge trouble for people who use CJK language, but telegram doesn't seem to plan to solve the problem, don't know why? Because telegram users hardly use CJK language, or is it technically not easy to achieve?

Actually, I found lots of CJK users on Telegram. But the search issue is limiting the number to grow.

Answer 7 · 2024-02-07T22:17:14.000Z

Still waiting for fix... this is important

Answer 8 · 2024-05-18T13:36:05.000Z

I was eager to have this feature before.

Now I eventually switched to Discord with my friends.