documentcloud/docsplit

diskspace leak when extracting text from pdf

Opened this issue · 1 comments

I try to extract the text of this pdf https://gofile.io/?c=6U8qE8. I have a rack application inside a docker container running on Ubuntu 18.04.

After calling Docsplit.extract_text('spec/test.pdf', ocr: true, language: 'eng', output: 'spec/output.txt') I see the process gs uses the most cpu power and I lose 1GB of diskspace every 5 seconds until there is no space left.

Maybe someone has an idea what is going wrong here?

While investigating an issue with a long-running Docsplit job, which was on a PDF that contained no text, I ran into this same issue on my local dev machine. Rails app running on a vagrant instance running Ubuntu. After running for 10+ minutes, I ran out of disk space. Killed the job and restarted my host machine to get 40 GB back.