grapheo12/iqps

Improve Search Algorithm

Opened this issue · 23 comments

[Sub-issues are ordered by their priorities]

  1. Currently, we use SOUNDEX algorithm as a fuzzy searching mechanism to fetch question papers. While it is robust to some minor spelling mistakes, it is not effective for longer queries.
    Hence we need to improve the searching algorithm. A good choice seems to be BITAP. Sadly, this is not part of the MySQL standard as SOUNDEX is, which means we need to code it ourselves either in MySQL functions or in raw C/C++ and integrate it with the database we are using. Any better suggestion is equally welcome.

  2. Another area in which the search can be improved is field coverage. Currently, only search by subject title is implemented. We can have some commonly known metadata and tags about the subject through which we can search. Also, search by department name and other pre-existing fields can be enabled.

  3. A third area would be to reduce network calls while searching. Currently, a rudimentary version of Debouncing is implemented in the frontend, with a large scope of improvement.

@munditva @shubhraneel @Debraj8301 @swarup3204 You are assigned to this repo. The issue is already divided up into 3 sections, which 2 of you must have a clash. Each one picks one. Let's brainstorm in this thread.
Remember, since this is open-ended, your efforts will be documented and that documentation is also worth 1 commit.

Please self-assign yourself so that I get to know you have started working.

Can I start working on implementing BITAP, the first subissue?

Go ahead

@grapheo12 I am a little unclear about subissue 2. Is search by tags going to be completely exclusive of search by title?
Also, for subissue 3, I found that MFQP returns the results quite fast, so what kind of improvement is needed?

@munditva Search by tags is not currently used anymore. The code is just there. You'll not see the tag filter activated even.
I want something that, you don't need the filters explicitly anymore. They will still be there. But "Compilers 2018" must return papers of the Compilers subject of the year 2018, without needing to set the filters.

At least this should work for the year-filter, we don't need it for other filters, at least for now.

Are you sure you can handle this sub-issue, given your academic pressure? If not you can consult with @shubhraneel and can document his work. That will be much lightweight.

You can also look into this: #24 as a part of this issue.

@grapheo12 I dont have much academic pressure, someone else said that ig.
I would like to work on the third subissue but am not sure how to proceed

@munditva My bad! Sub-issue 3 is open-ended. I'd suggest you work with #24. Then come here.
You can read about Debouncing from here: https://www.youtube.com/watch?v=Zo-6_qx8uxg&ab_channel=AkshaySaini

@swarup3204 Hey! I understand your academic pressure! For you, the best course of action would be to help @shubhraneel in writing and documenting the feature.

@shubhraneel Please talk to him and help.

As I said before, this feature requires a lot of experimentation and fine-tuning. All of this must also be documented. We use Sphinx to document the code.

@grapheo12 I find the code given in Wikipedia is enough if only substitution edit is considered and for deletion and insertion I read up and found this: Fuzzy String search I will need to modify the code in Wikipedia a little according to this and it shall work.

@grapheo12 I'm not getting how to use the C function in MySQL. The examples given run them in the mysql terminal. I'm not getting how do I use the CREATE FUNCTION in search/views.py. I tried searching a lot but got nothing.

@grapheo12 I find the code given in Wikipedia is enough if only substitution edit is considered and for deletion and insertion I read up and found this: Fuzzy String search I will need to modify the code in Wikipedia a little according to this and it shall work.

Yes go ahead

@grapheo12 I'm not getting how to use the C function in MySQL. The examples given run them in the mysql terminal. I'm not getting how do I use the CREATE FUNCTION in search/views.py. I tried searching a lot but got nothing.

You need to create a separate Dockerfile for Mysql. Just the base image won't work I guess. Don't worry about that now. Do one thing, set up Mariadb separately in your computer (native installation, not the docker one).I'll send you a dummy database to test your queries on.

I created https://github.com/grapheo12/mariadb-udf-docker to replace the db image we use. Can someone (except @shubhraneel :-) ) take up the task of integrating it into this repo? Ping here in case of doubt.

I was eager to test it out, so I built the sql new image and all that. It is working. It is applying the bitap algorithms. I will update the pull request. Also, I think I should add make -C ./mariadb-udf-bitap in install.sh, and change the template to use this image. I should do that, right?

@grapheo12 It's perfectly applying the bitap algorithm without much problems but I think the algorithm needs to be modified a little bit to suit this particular application.

Other than preprocessing the text to ignore hyphens or stuff, I observed that searching 'Algoryhms', shows 'graph theory and algorithms" much higher than needed. I should prioritize first letter matching, I guess.

@grapheo12 I have updated the Pull request, it is supposed to run properly with BITAP algorithms on searching in the search bar. There are a few minor modifications that may be needed to be made as it still uses the actual fuzzy BITAP algorithm with support for substitution, deletion, insertion. I may need to do some preprocessing of the text. I have tested the algorithm and it works better than SOUNDEX, I do not know if it is fruitful to increase the performance (accuracy, not speed) further as it feels for the scale of this application, the results are quite sufficient.

@swarup3204 This isn't the place to comment it. Isn't it?