Searching text with C
How it works
Pull out all the ordered pair of words, then use TF-IDF weighting on those, this way there’s no stemming or anything to worry about.
Why
It means that a search query will not need to contain exact words
when conducting a search unlike Recoll. Unlike grep
however this
approach will also work over multiple lines whereas grep only finds
matches, there is no logic for handling multiple terms,
e.g. .*damp.*
, .*reson.*
could be used one after the other with a
pipe |
, but in that case the absence of one will remove a candidate
rather than merely penalize a term.
This approach aims to be a compromise between the two.
Not needing an index would be really cool as well, but optionally having one would be really convenient to speed things up.
Performance Considerations
With python calling the program takes 1.7 times longer than it takes to run, there are overheads to calling a program, this will be far far less for C, but, I should incorporate the loop and the sorting inside C for the best performance.
C | Python | |
---|---|---|
Execution Time | 1*10^-3 | 0.3 |
Evaluation Time | 4*10^-5 | 0.5 |
Diff | 1 ms | 0.2 s |
time echo 'some words here' | ./search.py 'some words'
Note Size
There seems to be a difference in performance given the length of a note, given that Wikipedia appears to have articles that are on average 400-1000 characters long, that is what is reasonable to target with my notes.
Here are some notes of varying length:
Words | Name | Performance |
---|---|---|
27000 | Statistics All Export.org | quite good actually |
950 | Comparison of Markdown Editors | |
500 | cosine similarity | |
100 | Create node vertex graphs in latex |
This could be because the search query needs to be closer in length?
Actually by looking at the number of words that are meaningful in
longer documents, it becomes clear that the reason my searching
algorithm isn’t useful is because of the amount of words not related
to the search, particularly in org
buffers where words like PROPERTIES
and ID
and CUSTOM
are dominant.
this can be inspected by doing:
cat AbstractAlgebraNotes.org | tr -c '[:alnum:]' '[\n*]' | sort | uniq -c | sort -nr | head -30
Experiments searching for the most frequent words indicates that the search algorithm works very well in that case, so the solution to this is definitely to implement some form of TF-IDF weighting to account for words that are not important.
This might be hard in C, perhaps I should consider Go?
From experimentation, there are about 50, 000 unique 4-tuples in a document, and this just about represents the upper bounds of what is feasible for this approach, so, using 4-tuples, is not the ideal solution.
Implement TF-IDF Weighting with C.
Test using norm2 scaling instead of norm1.
I wonder if this could be wired in with org-roam somehow?
Large Array issues
A large array cannot be allocated on the heap (see here) this means that large arrays need to be declared outside of main or as static. static arrays cannot be given a variable size (see here) so this means we simply need to decide in advance the upper limit of files that will be tolerated.