Horrible failure with large documents
mikegerber opened this issue · 23 comments
@stweil reported in Gitter:
Improvements of dinglehopper are very welcome. The old version took more than 4 hours to process two text files with 1875 lines each and required about 30 GB RAM. The new version terminates after 2 minutes, but with out of memory: it was killed by the Linux kernel after using more than 60 GB RAM. :-(
@cneud also submitted a large document (a newspaper page).
- Investigate why the new version uses even more memory
- Consider falling back to more efficient algorithms if necessary
- Consider a regression test for this
I've asked @stweil to submit the text, as I am curious why it's using more memory.
I'll add them here soon.
I used this command (use links to get texts and result):
dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt frak2021_0.905_1587027_9141630
Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?
@mikegerber I am not familiar with dinglehopper, but I assume the editops calculation requires pretty much memory. For long texts I currently create a levenshtein matrix of len(s1) * len(s2) * 32 Bit
. Since the text files both have around 110k characters this alone should use around 45 Gb of memory.
The previous implementation used np.zeros((m + 1, n + 1), np.int)
, which as far as I know should require the same amount of memory. So I think this should not change the memory usage, but it could certainly be improved. This had no big pritority so far (mostly because nobody complained), but if thats the cause I could work on this.
An improved version could make use of bitparallelismus to improve the performance and only store bitvectors of the deltas between the matrix cells based on https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.1245&rep=rep1&type=pdf.
Performance wise this would allow the implementation to calculate the matrix cells for 64 characters at once.
Memory wise this should only require me to store the vertical positive delta vector
, the horizontal positive delta vector
and the diagonal zero delta vector
and therefore only require around len(s1) * len(s2) * 3 Bit
to store the matrix. So in the example above only around 4.2 Gb instead of 45 Gb.
Additionally it is possible to ignore some parts of the matrix which are not relevant, since they can not be on the optimal path, which in the best case (two strings of mostly similar length, which is the case in this usage) should allow the implementation to skip 25% of the matrix. This could further reduce the memory usage to 3.15 Gb.
The previous implementation used
np.zeros((m + 1, n + 1), np.int)
, which as far as I know should require the same amount of memory. So I think this should not change the memory usage, but it could certainly be improved.
This had no big pritority so far (mostly because nobody complained), but if thats the cause I could work on this.
I will have a look into this in the next days, I have a hunch that it takes twice at much memory because of a memory leak. Before, I had reused the matrix for two runs (1. the distance aka CER calculation and 2. the character alignment) and now I believe I was careless and just let rapidfuzz calculate it twice because it is much faster. OTOH, rapidfuzz should free the memory when it's done with the processing.
Your other suggestions sound promising! I'm going to have to read the paper. If the improved version still returns the¹ shortest aligment/distance I don't see why not to use it.
¹ or one of the alignments with the shortest possible distance, to be more precise
Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?
I've opened #63 for this!
I used this command (use links to get texts and result):
dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt frak2021_0.905_1587027_9141630
Mirrored here:
Before, I had reused the matrix for two runs (1. the distance aka CER calculation and 2. the character alignment) and now I believe I was careless and just let rapidfuzz calculate it twice because it is much faster. OTOH, rapidfuzz should free the memory when it's done with the processing.
Yes rapidfuzz should free the matrix (and as far as I am aware it does). Note that the CER calculation does not require much memory, since it only has to store the last matrix row. Only the character alignment requires the whole matrix
Your other suggestions sound promising! I'm going to have to read the paper. If the improved version still returns the¹ shortest aligment/distance I don't see why not to use it.
I already use this implementation to calculate the edit distance, just not when retrieving the editops (to keep the initial implementation simple).
@stweil Do you know why there are duplicates lines in the texts? It could be another bug in dinglehopper's text extraction. Would it be possible to get the PAGE versions of GT and OCR to check? (If necessary, DM me on Gitter and send it privately if that's a concern.)
The duplicate lines are fine. The texts were produced from single line text files (not from PAGE / ALTO / hOCR) which we use for finetuning of existing OCR models. And we create synthetic line images for the GT text lines, sometimes several images for the same text line.
The duplicate lines are fine. The texts were produced from single line text files (not from PAGE / ALTO / hOCR) which we use for finetuning of existing OCR models. And we create synthetic line images for the GT text lines, sometimes several images for the same text line.
While it is a problem that dinglehopper has issues with larger texts (i.e. newspaper pages), it is also best to feed it smaller texts if possible like it seems in this case. Alignment is O(length(GT) x length(OCR)).
Is this your use case?
- Compute CER and give a visualization for all text lines (line GT vs line OCR)
- Aggregate the CER over all text lines
If so, there are better ways to do it than concatenating the input. I can imagine implementing a special mode to read directories of line texts and summarize the result into one page and one global CER. Could you describe your input before the concatenation? (i.e. "one directory with *.gt.txt line texts and one directory with *.ocr.txt line texts with the same prefix")
Here I have one directory with *.gt.txt lines and several directories with OCR results *.txt which where produced with different software / models / process parameters. Each dinglehopper run should compare the GT directory with one of the OCR results directories. And yes, the file prefixes are the same.
But we also have the newspaper page use case.
But we also have the newspaper page use case.
Yes two problems with two distinct solutions :)
@mikegerber I implemented the concept I described above (appears to work, but still needs more testing and some cleanup): rapidfuzz/rapidfuzz-cpp#58
- It reduces the memory usage from 32 bit to 3 bit per cell (around 10x improvement)
- It significantly improves the performance for all strings especially for long ones, since it reduces the time complexity from
O(N*M)
toO([N/64]*M)
. E.g. when using two strings with 20k characters I achieved around a 20x improvement in runtime.
Edit: I successfully fuzz tested the new implementation.
The improved version is available in v1.9.0: https://github.com/maxbachmann/RapidFuzz/releases/tag/v1.9.0
@mikegerber I tested @stweil's original files with dinglehopper on my laptop which has only 8gb ram:
(base) [max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
0:22.97 real, 16.55 user, 7.33 sys, 5013484 mmem
So it is down to around 5gb of memory usage and less than 25 seconds of runtime with the new version of rapidfuzz.
@mikegerber I tested @stweil's original files with dinglehopper on my laptop which has only 8gb ram:
(base) [max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt 0:22.97 real, 16.55 user, 7.33 sys, 5013484 mmem
So it is down to around 5gb of memory usage and less than 25 seconds of runtime with the new version of rapidfuzz.
This is great news 😍 I think it was a great decision to use rapidfuzz as the backend library for dinglehopper - all the features I had wished for and with great support and improvements from you!
I'll be on vacation starting Thursday and I'll keep this issue open until I have tested this thoroughly (after the vacation). But for now I've bumped the dependency to rapidfuzz >= 1.9.1.
@mikegerber I finally came around to implement editops for long sequences using a combination of Hirschbergs algorithm and the current algorithm. It splits the problems into subproblems until it is relatively small (around 2k characters) and then solves it using the existing bitparallel algorithm.
This reduces memory usage to O(N)
and since it jumps around less in memory it improves performance for long sequences as well.
/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
0:06.40 real, 3.65 user, 2.73 sys, 3371976 mmem
improves to
/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
0:05.78 real, 3.86 user, 1.91 sys, 92228 mmem
improves to
/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt 0:05.78 real, 3.86 user, 1.91 sys, 92228 mmem
Absolutely fantastic! I'll depend on the new version when it gets a release version!
I think now that the memory consumption has been reduced from around 45Gb to less than 100mb and is no longer quadratic to the text length I think this issue has been resolved 😉
I think now that the memory consumption has been reduced from around 45Gb to less than 100mb and is no longer quadratic to the text length I think this issue has been resolved 😉
Yeah I think so too, just need to test it again!
Using @stweil's example:
% /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
0:04.15 real, 5.00 user, 1.35 sys, 76916 mmem
I've also tested files that @cneud gave which exploded to 40 GB memory usage and the system would swap. Now they run in less than a second!! I'll add them to the test suite.