Cyrillic letters

Question

Cyrillic letters

yoyurec opened this issue 2 years ago · 13 comments

Search word in cyrillic splitted by letters (((

Answer 1 · 2022-11-10T15:58:39.000Z

This is a bug, but I'm not sure where it happens yet.

Let me explain the current logic:
Step 1, fireSeqSearch reads all your notes and feed them to tantivy https://docs.rs/tantivy/latest/tantivy/ , and tantivy would do the search, including raking the hits
Step 2, fireSeqSearch adds highlights to the hits with a very naive algo. AFAIK tantivy doesn't tell us how it make its decisions.

Therefore, although the highlight, in this case, is terrible, I have a question to you. Do you think the top hit in this case is a real hit, or a false positive?

Answer 2 · 2022-11-10T18:49:35.000Z

Do you think the top hit in this case is a real hit, or a false positive?

yes, page titles contains search word

but highlights done by letters (((

Answer 3 · 2022-11-10T19:29:06.000Z

yes, page titles contains search word

Thank you, it confirmed my first guess. I'll try to fix that part.

Could you please provide some articles[1], so I could do some tests on it?

Thank you

[1]: with an open license like CC, or you have the copyright of it

Answer 4 · 2022-11-11T07:20:36.000Z

search for word "статья" - https://www.google.com/search?q=%D1%81%D1%82%D0%B0%D1%82%D1%8C%D1%8F&oq=%D1%81%D1%82%D0%B0%D1%82%D1%8C%D1%8F&sourceid=chrome&ie=UTF-8

demo file: Тестовая статья.md

Answer 5 · 2022-11-12T01:09:53.000Z

Hi @yoyurec

Do you have rust environment on your workstation?

If so, can you have a try for my PR 5223b0a

If not, I can provide you a pre-build binary for it (Win 64)

Answer 6 · 2022-11-12T03:36:11.000Z

no rust (((
binary would be awesome! tnx

Answer 7 · 2022-11-12T04:01:34.000Z

Hi, you can download the zip file at https://github.com/Endle/fireSeqSearch/releases/tag/dev_issue59 , this is compiled by MSYS2 (a bit too big)

The Windows binary should be suitable if you'd like to execute it with any Win-terminal. If not, I'll provide a MSVC binary tomorrow (currently GitHub Action is working on it)

Answer 8 · 2022-11-12T09:40:07.000Z

same result, same letters wrong highlighted - every letter (not ok) + whole word (ok) ((
monkeyscript the same or should be updated also?

Answer 9 · 2022-11-12T15:29:16.000Z

monkeyscript the same or should be updated also?

Nope. I'm 100% sure this is a bug on server-side

Seems that there're two bugs in my previous code, and I just fixed one :)
I added a mitigation that Tokenizer only applies to Chinese.

Please go to https://github.com/Endle/fireSeqSearch/releases/tag/dev_issue59 and try the v2 binary.
Thanks

Answer 10 · 2022-11-12T16:17:05.000Z

Please go to https://github.com/Endle/fireSeqSearch/releases/tag/dev_issue59 and try the v2 binary.

Sorry, I just found a bug I just introduced

Please try the v3 binary. Thanks

Answer 11 · 2022-11-14T12:41:48.000Z

v3 same (((

Answer 12 · 2022-11-14T15:20:11.000Z

It's weird, it worked fine on my computer

I've merged all the changes into master branch, and uploaded v4 to https://github.com/Endle/fireSeqSearch/releases/tag/dev_issue59

This time I'm not compiling for myself, but using GitHub action, which is the same as public releases.

If it still fails, can you try to run server with
RUST_BACKTRACE=1 RUST_LOG=debug

Sorry for letting you test so many times :(

Answer 13 · 2022-11-14T17:25:44.000Z

maybe something went wrong with testing previously...

now works! 🎉💪🔥🤗
tnx for fixing!

Sorry for letting you test so many times :(

np!