bluesky-social/indigo

Improve the search quality especially for CJK queries

hkurokawa opened this issue · 9 comments

Hello,

I am not sure if this is the right place to report an issue about the search quality of bsky.app. Please feel free to point me to the right PoC if not.

I found that the full-text search quality was not really great especially for CJK (Chinese, Japanese and Korean) queries. I guess this is because the analyzer used in the Elasticsearch is the default one and the tokenization is done in a uni-gram-ish way. CJK languages do not use a white space as a separator of words and we need some tokenization to do full-text search on them. Uni-gram tokenization is the most naive tokenization and that is not very useful most of the time.

Steps to reproduce

  1. Run curl 'https://search.bsky.social/search/posts?q=%E7%86%B1%E5%8A%9B%E5%AD%A6' (The decoded query is 熱力学 that means thermodynamics in Japanese)

Result

At step 1, a post not containing "熱力学" but containing "熱" (heat) and "力" (power) and "学" (study) separately is returned.

For example, a post something like "学校から帰って熱いお風呂に入ったら力一杯がんばる" (This means "I will do my best after coming back from school and taking a hot bath") would be included in the response.

Expected result

At step 1, a post not containing "熱力学" is not included in the response.

Remarks

Someone may think a phrase search (e.g., a phrase surrounded by double quotes) would solve the issue. That may work to some extent but it would not solve the issue entirely. For example, "東京都" (Tokyo) and "京都" (Kyoto) are completely separate words in Japanese and if a text search returns a post containing "東京都" for a query "京都", the user would think that the search is just useless.

I would suggest configuring a better Elasticsearch Analyzer for certain languages such as CJK and use them. Please feel free to ask me if you have any questions. Thanks!

Hi @hkurokawa, thanks for the detailed ticket!

We have a branch as work-in-progress which changes up how we use OpenSearch (based on Lucene, fork of Elasticsearch), which includes using the ICU plugin tokenization, normalization, and folding rules:

https://github.com/bluesky-social/indigo/pull/263/files#diff-a7cd828df6438861fe3ec63c63ca68be113cd0f7d670d0f52d371c27f3bea81e

I'm not positive this will resolve your specific issues, but it may, and we can do some testing with the examples you give.

Thank you for the update, @bnewbold. Sure, let's revisit this after your change lands. Please feel free to close the issue or keep it around if you want to track this somewhere. Up to you. Thanks!

I haven't tested deeply, but for the specific examples you give I think the new index config should work (#263).

Hello,

Really sorry for the delay in my response. I somehow missed the notification email. I ran the curl command today and I am afraid the issue is not fixed yet.

Please try to run the command and see if all the returned post contained "熱力学" in their text. I understand that it would be hard to recognize the Japanese text. Maybe you can just grep the text by the term.

Please feel free to let me know if there is anything I can help you to address the issue. Thanks!

The new search endpoint is:

http get 'https://api.bsky.app/xrpc/app.bsky.feed.searchPosts?q=%E7%86%B1%E5%8A%9B%E5%AD%A6'

unfortunately this new version of search still hasn't arrived in the app for post search (it is being used for profile search).

Thanks for the prompt update! Got it. I confirmed that the new endpoint seemed to return a much better result. Great job!

Ok, we finally shipped these changes in the app. You may need to refresh the web app, or wait a bit for mobile app updates.

I did a bit of testing and I don't think we have "solved" this issue yet. I hope it is at least a bit better? But curious for your feedback.

Thank you so much for your hard work. I tried some queries and it seemed to me much better than before. Specifically, my original issue was resolved so I am going to close the issue.

I will test other queries going forward and will file another issue if I find anything. So far, it works really well. Great job and much appreciated. Thanks!

Thank you for your excellent original report, patience, and kind words!