Horrible failure with large documents

Question

Horrible failure with large documents

mikegerber opened this issue 3 years ago · 23 comments

@stweil reported in Gitter:

Improvements of dinglehopper are very welcome. The old version took more than 4 hours to process two text files with 1875 lines each and required about 30 GB RAM. The new version terminates after 2 minutes, but with out of memory: it was killed by the Linux kernel after using more than 60 GB RAM. :-(

@cneud also submitted a large document (a newspaper page).

Investigate why the new version uses even more memory
Consider falling back to more efficient algorithms if necessary
Consider a regression test for this

mikegerber commented 2 years ago

0fd4ea1

Answer 1 · 2021-12-08T14:35:05.000Z

I've asked @stweil to submit the text, as I am curious why it's using more memory.

Answer 2 · 2021-12-08T15:03:06.000Z

I'll add them here soon.

Answer 3 · 2021-12-08T15:25:32.000Z

I used this command (use links to get texts and result):

dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt frak2021_0.905_1587027_9141630

Answer 4 · 2021-12-08T15:31:29.000Z

Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?

Answer 5 · 2021-12-08T17:30:58.000Z

@mikegerber I am not familiar with dinglehopper, but I assume the editops calculation requires pretty much memory. For long texts I currently create a levenshtein matrix of len(s1) * len(s2) * 32 Bit. Since the text files both have around 110k characters this alone should use around 45 Gb of memory.

The previous implementation used np.zeros((m + 1, n + 1), np.int), which as far as I know should require the same amount of memory. So I think this should not change the memory usage, but it could certainly be improved. This had no big pritority so far (mostly because nobody complained), but if thats the cause I could work on this.

An improved version could make use of bitparallelismus to improve the performance and only store bitvectors of the deltas between the matrix cells based on https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.1245&rep=rep1&type=pdf.
Performance wise this would allow the implementation to calculate the matrix cells for 64 characters at once.
Memory wise this should only require me to store the vertical positive delta vector, the horizontal positive delta vector
and the diagonal zero delta vector and therefore only require around len(s1) * len(s2) * 3 Bit to store the matrix. So in the example above only around 4.2 Gb instead of 45 Gb.

Additionally it is possible to ignore some parts of the matrix which are not relevant, since they can not be on the optimal path, which in the best case (two strings of mostly similar length, which is the case in this usage) should allow the implementation to skip 25% of the matrix. This could further reduce the memory usage to 3.15 Gb.

Answer 6 · 2021-12-08T17:56:10.000Z

The previous implementation used np.zeros((m + 1, n + 1), np.int), which as far as I know should require the same amount of memory. So I think this should not change the memory usage, but it could certainly be improved.
This had no big pritority so far (mostly because nobody complained), but if thats the cause I could work on this.

I will have a look into this in the next days, I have a hunch that it takes twice at much memory because of a memory leak. Before, I had reused the matrix for two runs (1. the distance aka CER calculation and 2. the character alignment) and now I believe I was careless and just let rapidfuzz calculate it twice because it is much faster. OTOH, rapidfuzz should free the memory when it's done with the processing.

Your other suggestions sound promising! I'm going to have to read the paper. If the improved version still returns the¹ shortest aligment/distance I don't see why not to use it.

¹ or one of the alignments with the shortest possible distance, to be more precise

Answer 7 · 2021-12-08T17:58:36.000Z

Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?

I've opened #63 for this!

Answer 8 · 2021-12-08T18:07:29.000Z

I used this command (use links to get texts and result):

dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt frak2021_0.905_1587027_9141630

Mirrored here:

Answer 9 · 2021-12-08T18:23:39.000Z

Before, I had reused the matrix for two runs (1. the distance aka CER calculation and 2. the character alignment) and now I believe I was careless and just let rapidfuzz calculate it twice because it is much faster. OTOH, rapidfuzz should free the memory when it's done with the processing.

Yes rapidfuzz should free the matrix (and as far as I am aware it does). Note that the CER calculation does not require much memory, since it only has to store the last matrix row. Only the character alignment requires the whole matrix

Your other suggestions sound promising! I'm going to have to read the paper. If the improved version still returns the¹ shortest aligment/distance I don't see why not to use it.

I already use this implementation to calculate the edit distance, just not when retrieving the editops (to keep the initial implementation simple).

Answer 10 · 2021-12-08T19:57:00.000Z

@stweil Do you know why there are duplicates lines in the texts? It could be another bug in dinglehopper's text extraction. Would it be possible to get the PAGE versions of GT and OCR to check? (If necessary, DM me on Gitter and send it privately if that's a concern.)

Answer 11 · 2021-12-08T20:04:32.000Z

The duplicate lines are fine. The texts were produced from single line text files (not from PAGE / ALTO / hOCR) which we use for finetuning of existing OCR models. And we create synthetic line images for the GT text lines, sometimes several images for the same text line.

Answer 12 · 2021-12-09T11:09:15.000Z

The duplicate lines are fine. The texts were produced from single line text files (not from PAGE / ALTO / hOCR) which we use for finetuning of existing OCR models. And we create synthetic line images for the GT text lines, sometimes several images for the same text line.

While it is a problem that dinglehopper has issues with larger texts (i.e. newspaper pages), it is also best to feed it smaller texts if possible like it seems in this case. Alignment is O(length(GT) x length(OCR)).

Is this your use case?

Compute CER and give a visualization for all text lines (line GT vs line OCR)
Aggregate the CER over all text lines

If so, there are better ways to do it than concatenating the input. I can imagine implementing a special mode to read directories of line texts and summarize the result into one page and one global CER. Could you describe your input before the concatenation? (i.e. "one directory with *.gt.txt line texts and one directory with *.ocr.txt line texts with the same prefix")

Answer 13 · 2021-12-09T11:22:24.000Z

Here I have one directory with *.gt.txt lines and several directories with OCR results *.txt which where produced with different software / models / process parameters. Each dinglehopper run should compare the GT directory with one of the OCR results directories. And yes, the file prefixes are the same.

But we also have the newspaper page use case.

Answer 14 · 2021-12-09T12:43:57.000Z

But we also have the newspaper page use case.

Yes two problems with two distinct solutions :)

Answer 15 · 2021-12-11T17:07:28.000Z

@mikegerber I implemented the concept I described above (appears to work, but still needs more testing and some cleanup): rapidfuzz/rapidfuzz-cpp#58

It reduces the memory usage from 32 bit to 3 bit per cell (around 10x improvement)
It significantly improves the performance for all strings especially for long ones, since it reduces the time complexity from O(N*M) to O([N/64]*M). E.g. when using two strings with 20k characters I achieved around a 20x improvement in runtime.

Edit: I successfully fuzz tested the new implementation.
The improved version is available in v1.9.0: https://github.com/maxbachmann/RapidFuzz/releases/tag/v1.9.0

Answer 16 · 2021-12-13T22:52:17.000Z

@mikegerber I tested @stweil's original files with dinglehopper on my laptop which has only 8gb ram:

(base) [max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:22.97 real,	16.55 user,	7.33 sys,	5013484 mmem

So it is down to around 5gb of memory usage and less than 25 seconds of runtime with the new version of rapidfuzz.

Answer 17 · 2021-12-14T11:22:55.000Z

@mikegerber I tested @stweil's original files with dinglehopper on my laptop which has only 8gb ram:
(base) [max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:22.97 real,	16.55 user,	7.33 sys,	5013484 mmem
So it is down to around 5gb of memory usage and less than 25 seconds of runtime with the new version of rapidfuzz.

This is great news 😍 I think it was a great decision to use rapidfuzz as the backend library for dinglehopper - all the features I had wished for and with great support and improvements from you!

I'll be on vacation starting Thursday and I'll keep this issue open until I have tested this thoroughly (after the vacation). But for now I've bumped the dependency to rapidfuzz >= 1.9.1.

Answer 18 · 2022-07-24T19:27:42.000Z

@mikegerber I finally came around to implement editops for long sequences using a combination of Hirschbergs algorithm and the current algorithm. It splits the problems into subproblems until it is relatively small (around 2k characters) and then solves it using the existing bitparallel algorithm.
This reduces memory usage to O(N) and since it jumps around less in memory it improves performance for long sequences as well.

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:06.40 real,	3.65 user,	2.73 sys,	3371976 mmem

improves to

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:05.78 real,	3.86 user,	1.91 sys,	92228 mmem

Answer 19 · 2022-07-27T10:03:48.000Z

improves to

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:05.78 real,	3.86 user,	1.91 sys,	92228 mmem

Absolutely fantastic! I'll depend on the new version when it gets a release version!

Answer 20 · 2022-08-16T14:44:52.000Z

I think now that the memory consumption has been reduced from around 45Gb to less than 100mb and is no longer quadratic to the text length I think this issue has been resolved 😉

Answer 21 · 2023-02-28T17:13:50.000Z

I think now that the memory consumption has been reduced from around 45Gb to less than 100mb and is no longer quadratic to the text length I think this issue has been resolved 😉

Yeah I think so too, just need to test it again!

Answer 22 · 2023-03-02T15:22:55.000Z

Using @stweil's example:

% /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
        0:04.15 real,   5.00 user,      1.35 sys,       76916 mmem

I've also tested files that @cneud gave which exploded to 40 GB memory usage and the system would swap. Now they run in less than a second!! I'll add them to the test suite.