seatgeek/fuzzywuzzy

Performance Optimization - Fail Fast

ifitzsimmons opened this issue · 1 comments

I apologize if this feature is already available, but I was wondering if there is currently a way to pass an argument that represents some threshold into one of the ratio methods.

I have an extremely large dataset in which I need to calculate the distance between two very long lists with relatively long strings.

To increase performance, I'd like to be able to short circuit the calculation of the distance once there are enough differences that would drop the ration below, say, 60 % and just return 0 or None.

It's not quite clear if that's what the cutoff keyword arg does or if that parameter is more of a filter.

Again, apologies if this is already a thing.

Your right in FuzzyWuzzy score_cutoff is just a filter. So it will calculate the results and filter them afterwards.
RapidFuzz uses exactly the behaviour you describe to improve the performance (Beside this it has some other improvements to the algorithm and is implemented fully in C++, so it is a lot faster)