/search_score

Experimenting with String Matching approaches in C

Primary LanguageC

Searching text with C

How it works

Pull out all the ordered pair of words, then use TF-IDF weighting on those, this way there’s no stemming or anything to worry about.

Why

It means that a search query will not need to contain exact words when conducting a search unlike Recoll. Unlike grep however this approach will also work over multiple lines whereas grep only finds matches, there is no logic for handling multiple terms, e.g. .*damp.*, .*reson.* could be used one after the other with a pipe |, but in that case the absence of one will remove a candidate rather than merely penalize a term.

This approach aims to be a compromise between the two.

Not needing an index would be really cool as well, but optionally having one would be really convenient to speed things up.

Performance Considerations

With python calling the program takes 1.7 times longer than it takes to run, there are overheads to calling a program, this will be far far less for C, but, I should incorporate the loop and the sorting inside C for the best performance.

CPython
Execution Time1*10^-30.3
Evaluation Time4*10^-50.5
Diff1 ms0.2 s
time echo 'some words here' | ./search.py 'some words'   

Note Size

There seems to be a difference in performance given the length of a note, given that Wikipedia appears to have articles that are on average 400-1000 characters long, that is what is reasonable to target with my notes.

Here are some notes of varying length:

WordsNamePerformance
27000Statistics All Export.orgquite good actually
950Comparison of Markdown Editors
500cosine similarity
100Create node vertex graphs in latex

This could be because the search query needs to be closer in length?

Actually by looking at the number of words that are meaningful in longer documents, it becomes clear that the reason my searching algorithm isn’t useful is because of the amount of words not related to the search, particularly in org buffers where words like PROPERTIES and ID and CUSTOM are dominant.

this can be inspected by doing:

cat AbstractAlgebraNotes.org | tr -c '[:alnum:]' '[\n*]' | sort | uniq -c | sort -nr | head  -30

Experiments searching for the most frequent words indicates that the search algorithm works very well in that case, so the solution to this is definitely to implement some form of TF-IDF weighting to account for words that are not important.

This might be hard in C, perhaps I should consider Go?

From experimentation, there are about 50, 000 unique 4-tuples in a document, and this just about represents the upper bounds of what is feasible for this approach, so, using 4-tuples, is not the ideal solution.

Implement TF-IDF Weighting with C.

Test using norm2 scaling instead of norm1.

I wonder if this could be wired in with org-roam somehow?

Large Array issues

A large array cannot be allocated on the heap (see here) this means that large arrays need to be declared outside of main or as static. static arrays cannot be given a variable size (see here) so this means we simply need to decide in advance the upper limit of files that will be tolerated.