qurator-spk/dinglehopper

Improve performance when calculating sequence alignment

Closed this issue · 14 comments

b2m commented

Dinglehopper is using a custom Python implementation of the Levenshtein distance to calculate, score and show an alignment of two given texts.

According to my performance analysis done for #47 the distance and editops functions of this custom implementation is the main bottleneck when comparing explicitly bad or big OCR results.

In #48 I proposed to use the C based python-Levenshtein as a replacement, which we discarded for the following reasons:

  1. No support for aligning sequences of words (see comment by @mikegerber).
  2. Currently no active maintenance.
  3. Viral license (GPL 2)

One alternative and fast implementation for distance calculation is RapidFuzz, where @maxbachmann already started to adress the issue of the distance calculation for arbitrary sequences in rapidfuzz/RapidFuzz#100.

At the moment RapidFuzz is not supporting the calculcation of edit operations (see comment by @maxbachmann).

Just to be thorough:

1. No support for aligning sequences of words (see [comment by @mikegerber](https://github.com/qurator-spk/dinglehopper/pull/48#issuecomment-818021820)).

No support for arbitrary data, I would say. Technically, we align¹/measure distance between a. sequences of grapheme clusters (which is not always the same as strings) and b. sequences of words.

¹ I would consider aligning as equivalent to "editops"

Technically I believe any sequence alignment implementation would do if it a. supports substitutions and b. is optimal in the sense that it gives a shortest possible alignment. If we have that I believe we could just count for the edit distance calculation.

However I can imagine dropping the alignment requirement if the user chooses to do that to improve the speed of calculating the metrics, as there are implementations that can do that calculation more efficiently (speed/memory) without providing the alignemnt. In my experience it is however incredibly important to have a manual look on the alignment (= the differences report) to get a sense of the problems and this is even more important with larger texts. (As I understand it, this could be done with RapidFuzz already because of optimizations)

The second thing I can imagine is considering more efficient sequence alignment algorithms, that can easily handle larger texts. This is more problematic, as this is an evaluation tool. I deliberately chose an easily auditable textbook implementation of an optimal algorithm (optimal in the sense of shortest alignment = shortest edit distance) and I would hate to introduce vagueness in the sense of "This could be the CER but we don't know it for sure because of worst case behaviour of the algorithm". But maybe there is no way around this.

For now, the best short term outcome would be if we get a "reasonable" performance for real world data I've seen. @cneud reported hours of runtime and 40 GB of memory usage for a newspaper page (a lot of text!) - That is certainly not "reasonable". I'll see if the license of the data allows uploading it. (The memory usage comes from 68188 characters of GT text, assuming roughly the same size for the OCR text and 64 bits to store an element of the Levenshtein matrix gives 68188x68188x8/(1024x1024x1024) =~ 34 GB of memory usage...)

Another possibility: Efficiently calculate an optimal(!) CER, but the alignment might by calculated by some more efficient algorithm (not optimal, but fast) - with a warning for the user that the CER does not necessarily correspond exactly.

(The memory usage comes from 68188 characters of GT text, assuming roughly the same size for the OCR text and 64 bits to store an element of the Levenshtein matrix gives 68188x68188x8/(1024x1024x1024) =~ 34 GB of memory usage...)

No editops required

For any texts where the editops are not required the Levenshtein distance can be implemented using linear memory.

Performance more important than memory usage

For cases where the Levenshtein matrix is actually required it could choose the type based upon max(len1, len2)

  • <= 255 -> uin8_t
  • <= 65535 -> uint16_t
  • <= 4294967295 -> uint32_t
  • else uint64_t

Most PCs will not even have enough memory to ever require uint64_t. E.g. in your example this would reduce the memory requirement by 50%.

memory usage more important than performance

It would be possible to be even more aggressive by creating a specialized container, which always has submatrices using the smaller datatype. This could reduce the requirement for a 68188x68188 matrix to:

  • (68188*68188 - 65535*65535) * 4 + (65535*65535 - 255*255) * 2 + 255*255 = 10 Gb

This third approach could be selected either by an argument, or when the malloc for the faster implementation fails

known upper threshold

When an upper threshold for the Levenshtein distance is defined, this can be used to reduce the memory usage in a similar way to the approach using max(len1, len2) above, which can significantly reduce the memory usage. This can also be used to reduce the memory usage to min(len1, len2) * (k * 2 + 1) in case that is smaller than len1*len2 (see https://www.sciencedirect.com/science/article/pii/S0019995885800462)
So e.g. for a threshold k=255 and len1=len2=68188 this would require around 68188*(255*2+1) =~ 34kb

Editops but not whole Levenshtein matrix required

It is possible to remove the common prefix and suffix of two strings before calculating the Levenshtein distance. As far as I could see you are often matching big documents which only have small differences, so this could significantly reduce the memory usage (and the runtime since removing the common parts is a O(N) operation, while it is pretty much free when there is no common affix). As far as I know this is done by python-Levenshtein as well.

At the moment RapidFuzz is not supporting the calculcation of edit operations (see comment by @maxbachmann).

@mikegerber @b2m In the Levenshtein matrix there are often multiple optimal paths, while editops will just return one. Is there any reason to prefer one path over another? The simplest way would be:

if (diagonal == current)      -> no operation
if (diagonal == current -1) -> Substitution
if (left == current - 1)         -> Insertion
if (above == current - 1)    -> Deletion

python-Levenshtein extends this slightly by preferring multiple similar operations in a row:

if (last_operation == Insertion and left == current - 1)         -> Insertion
if (last_operation == Deletion and above == current - 1)    -> Deletion

if (diagonal == current)      -> no operation
if (diagonal == current -1) -> Substitution
if (left == current - 1)         -> Insertion
if (above == current - 1)    -> Deletion

I guess the reason for this path is, that python-Levenshtein allows the conversion into opcodes, which always specify a range for which the operation is used -> Multiple similar operations in a row can be combined into a single range, which results in a shorter list of opcodes.

Is there any reason to prefer any of the paths?

@maxbachmann already started to adress the issue of the distance calculation for arbitrary sequences in rapidfuzz/RapidFuzz#100.

This is now supported in the next release (implemented just not released yet).

> from rapidfuzz import string_metric
> string_metric.levenshtein(["aaaa", "bbbb"], ["aaaa", "cccc"])
1

note that this operates on the hashes of the strings. So this is the same as:

> from rapidfuzz import string_metric
> string_metric.levenshtein([hash("aaaa"), hash("bbbb")], [hash("aaaa"), hash("cccc")])
1

So when two strings produce the same hash value they will be considered as equivalent

b2m commented

@maxbachmann I have not checked the python-Levenshtein original code, but the last_operation seems like a performance tweak.

Using hashing to be able to compare "arbitrary" sequences sounds sensible to me. Yes there will be collisions, and there are detailed StackOverflow questions/answers going into detail on when this happens.

From a Python programmers perspective I would worry about comparing sequences with object that do not implement a proper __hash__ method.

I have not checked the python-Levenshtein original code, but the last_operation seems like a performance tweak.

In which cases would this help performance. My guess was that it has something to do with these weird conversions in python-Levenshtein from editops to opcodes and editops to matching_blocks to make them as similar as possible to the results from difflib (Even though at least matching_blocks is completely broken and will not return the correct results in many cases).

I will probably add another function to get the Levenshtein matrix in a space optimized way as well using some of the methods described above.

From a Python programmers perspective I would worry about comparing sequences with object that do not implement a proper hash method.

The only alternative would be using __eq__. However

  1. This would not work with all algorithms (many of the implementations rely on the elements being compared to be numbers)
  2. even algorithms which could use __eq__ like the normal O(N*M) Levenshtein implementation would become painfully slow, since __eq__ is
  • a Python call which adds tons of overhead
  • probably O(N) which causes the algorithm to have cubic runtime

Note that one special case is, that I directly use the corresponding unicode code-point for strings with a length of 1, so the following works:

> from rapidfuzz import string_metric
> string_metric.levenshtein("aaaa", ["a", "a", "a", "a"])
0
> string_metric.levenshtein("aaaa", [ord("a"), ord("a"), ord("a"), ord("a")])
0

@b2m I released v1.5.0 of rapidfuzz yesterday, which does now support all sequences of hashable objects and editops:

>>> from rapidfuzz.string_metric import levenshtein_editops
>>> for tag, src_pos, dest_pos in levenshtein_editops("qabxcd", "abycdf"):
...    print(("%7s s1[%d] s2[%d]" % (tag, src_pos, dest_pos)))
  delete s1[1] s2[0]
replace s1[4] s2[3]
   insert s1[6] s2[6]
b2m commented

I have not checked the python-Levenshtein original code, but the last_operation seems like a performance tweak.

Addition: performance tweak in the sense of determining ranges of inserts or deletes while determining the type of edit operation. The alternative would be to have a second pass through all the edit operations.

I released v1.5.0 of rapidfuzz yesterday

Thx for the heads up... should go to @mikegerber as he is the maintainer of this library.

support all sequences of hashable objects and editops

I like the simplicity of the interface and the returned data!

Awesome, I have been on vacation and I will check this out soon!

I did some experiments with RapidFuzz and noticed a remaining problem with the positions returned by RapidFuzz's editops(). I had some similar problems with my own implementation - in fact they are still there - and so investigated it a bit. I think python-Levenshtein's positions make a lot of sense and this should be fixed; I opened an issue at rapidfuzz/RapidFuzz#148.

Other than that: the new functions in RapidFuzz are exactly what we need in dinglehopper and we will use them 🥂

@maxbachmann was quick: PR rapidfuzz/rapidfuzz-cpp#53 fixes the problem.

I just switched dinglehopper over to using RapidFuzz. Wow, 50x faster!

I still want to do some profiling (duplicate conversions? duplicate/cacheable calls?) and more benchmarking but in essence this issue is fixed in the current version (= master).

This is done, and with #72 (also in milestone 1.0) the remaining issue (duplicate calls) should be done, too.