Endle/fireSeqSearch

Cyrillic letters

yoyurec opened this issue · 13 comments

Search word in cyrillic splitted by letters (((

image

Endle commented

This is a bug, but I'm not sure where it happens yet.

Let me explain the current logic:
Step 1, fireSeqSearch reads all your notes and feed them to tantivy https://docs.rs/tantivy/latest/tantivy/ , and tantivy would do the search, including raking the hits
Step 2, fireSeqSearch adds highlights to the hits with a very naive algo. AFAIK tantivy doesn't tell us how it make its decisions.

Therefore, although the highlight, in this case, is terrible, I have a question to you. Do you think the top hit in this case is a real hit, or a false positive?

Do you think the top hit in this case is a real hit, or a false positive?

yes, page titles contains search word

image

but highlights done by letters (((

Endle commented

yes, page titles contains search word

Thank you, it confirmed my first guess. I'll try to fix that part.

Could you please provide some articles[1], so I could do some tests on it?

Thank you

[1]: with an open license like CC, or you have the copyright of it

Endle commented

Hi @yoyurec

Do you have rust environment on your workstation?

If so, can you have a try for my PR 5223b0a

If not, I can provide you a pre-build binary for it (Win 64)

no rust (((
binary would be awesome! tnx

Endle commented

Hi, you can download the zip file at https://github.com/Endle/fireSeqSearch/releases/tag/dev_issue59 , this is compiled by MSYS2 (a bit too big)

The Windows binary should be suitable if you'd like to execute it with any Win-terminal. If not, I'll provide a MSVC binary tomorrow (currently GitHub Action is working on it)

same result, same letters wrong highlighted - every letter (not ok) + whole word (ok) ((
monkeyscript the same or should be updated also?

Endle commented

monkeyscript the same or should be updated also?

Nope. I'm 100% sure this is a bug on server-side

Seems that there're two bugs in my previous code, and I just fixed one :)
I added a mitigation that Tokenizer only applies to Chinese.

Please go to https://github.com/Endle/fireSeqSearch/releases/tag/dev_issue59 and try the v2 binary.
Thanks

Endle commented

Please go to https://github.com/Endle/fireSeqSearch/releases/tag/dev_issue59 and try the v2 binary.

Sorry, I just found a bug I just introduced

Please try the v3 binary. Thanks

v3 same (((
image

Endle commented

It's weird, it worked fine on my computer

image

I've merged all the changes into master branch, and uploaded v4 to https://github.com/Endle/fireSeqSearch/releases/tag/dev_issue59

This time I'm not compiling for myself, but using GitHub action, which is the same as public releases.

If it still fails, can you try to run server with
RUST_BACKTRACE=1 RUST_LOG=debug

Sorry for letting you test so many times :(

maybe something went wrong with testing previously...

now works! 🎉💪🔥🤗
tnx for fixing!

image

Sorry for letting you test so many times :(

np!