Implement multiprocessing speedup for `suggest` in absence of python-Levenshtein
bskinn opened this issue · 7 comments
NOTE: With the deprecation of the python-Levenshtein
speedup for the suggest
functionality (see #211 & #218), identifying other methods to increase performance is a priority. This multi-processing based approach is the best one I've thought of so far. If anyone has another suggestion, please open a new issue to discuss it.
Very early POC on suggest-multiproc branch.
Suggests some speedup possible, but not a slam-dunk whether it's worth it given the sub-2s processing times for most inventories. Properly exploiting difflib.SequenceMatcher
's caching behavior may change this, however.
For comparison, python-Levenshtein is a better speedup for less internal complexity, and doesn't appear to benefit at all from multiproc.
Latitude laptop (4 cores), yt (17420 objects)
Quick, rough timing:
>>> import sphobjinv as soi
>>> import timeit
>>> timeit.timeit("soi.Inventory('tests/resource/objects_yt.inv').suggest('ndarray')", globals=globals(), number=1)
NO mp, WITH lev: 2.177 s
WITH mp(4), WITH lev: 2.396 s
WITH mp(6), WITH lev: 2.355 s
NO mp, NO lev: 11.795 s
WITH mp(2), NO lev: 10.361 s
WITH mp(3), NO lev: 8.471 s
WITH mp(4), NO lev: 7.583 s (~35% speedup)
WITH mp(5), NO lev: 7.399 s (oversubscribed)
WITH mp(6), NO lev: 8.372 s (oversubscribed)
Notes
- Probably can switch to
pool.map()
, without the context manager - Looks like using
multiprocessing.cpu_count()
would be a reasonable default pool size - Want user to be able to set max of process count (probably refuse to oversubscribe??)
- API -- arg to
Inventory.suggest()
, likely - CLI -- new parameter for
suggest
subparser
- API -- arg to
- If
nproc == 1
then skipmultiprocessing
entirely - Probably need to check at
sphobjinv
import-time whethermultiprocessing
is available- Doing that in a way that doesn't further slow down the import time may be tricky
- Per the
difflib
docs, implementing a bespoke scoring function directly withdifflib.SequenceMatcher
may allow significant speed gains, due to the ability to cache thesuggest
search term - For v2.3, perhaps keep the current single-process fuzzywuzzy method as the default, but implement with a feature flag to switch to the mp-accelerated version?
- A naive, direct use of
difflib.SequenceMatcher
does not give good suggestions for some searches (e.g., "dataobj" in the sphobjinv inventory), compared to the default,WRatio
-based partial-matcher infuzzywuzzy
- Going to want to port this fuzzywuzzy method directly into the codebase, adapted to allow for multiprocess, I think.
If I'm going to have to re-implement fuzzywuzzy
's WRatio
as part of this, I might as well do it in a way that exposes the internal weighting values inside WRatio
, so that if someone wanted to they could adjust those values. Have to think about whether there's a way to do that ~cleanly in a plugin system, though...
Scorer should make its own internal decision about multi or not, but suggest should have an option to coerce single or nproc if user prefers, or if the multi-detector is making a bad determination.
The scorer should have a fully inspect able/introspectable API. Defined interface on how suggest will call it. Should be the most general, information rich inputs possible, which is probably the search term and the inventory itself!
Provide index and score flags, too? Threshold? I could see a scorer knowing enough about its own properties to be able to make a quick first pass and discard sufficiently poor matches. Might as well provide all the information possible. Have to set this up so that it's easy to add more information if something new comes up.
SuggestPayload
object with Inventory
and suggest parameters (index, score, thresh... others?) passed to scorer, along with an extra_kwargs
dict of additional arguments that can adjust the scorer behavior.
Best practice... recommend that any scorer, builtin or plugged, define Enums
for use as the keys in extra_kwargs
?
As part of the new implementation of the multiprocessing-enabled WRatio
scorer, perhaps add an option to 'devalue' object match scores if there's no substring match?
E.g., if search not in rst
, then adjust score as
Either way, keep the legacy
scorer, available under the legacy
id... and call the new one default
, probably.
Can consider using https://pypi.org/project/editdistance/ as a new speedup
extra, integrating into the new default
scorer.
Or --- if it's fast enough with the editdistance
speedup, may be able to avoid writing the multiprocessing-accelerated scorer entirely.