Figure out way to more quickly destem terms
Closed this issue · 1 comments
dmarklein commented
Large document set causes current destemming process to take a LONG time. There has to be a more efficient algorithm.
dmarklein commented
Some ideas:
(1) search only until we find a version of the stemmed term that stands out (in other words, that accounts for a certain percentage of matches after we have reviewed a certain amount of our body of text).
*** I could take advantage of using list comprehension to find all matches in a given doc, instead of explicitly iterating over each term in the doc.
(2) find some module/library that is really fast for searching lists... I'm still searching.
(3) use threading -- I don't think this is a good idea.