RuntimeError: can't start new thread
Closed this issue · 7 comments
I did a quick test and got this error below
System information
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
Linux dev4-1 5.15.0-113-generic #123-Ubuntu SMP Mon Jun 10 08:16:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Client: Docker Engine - Community
Version: 27.0.3
API version: 1.46
Go version: go1.21.11
Git commit: 7d4bcd8
Built: Sat Jun 29 00:02:33 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 27.0.3
API version: 1.46 (minimum version 1.24)
Go version: go1.21.11
Git commit: 662f78c
Built: Sat Jun 29 00:02:33 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.18
GitCommit: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
nvidia:
Version: 1.7.18
GitCommit: v1.1.13-0-g58aa920
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Error log
❯ docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -v -i ./inby.pdf
Unable to find image 'leofcardoso/pdf2pdfocr:latest' locally
latest: Pulling from leofcardoso/pdf2pdfocr
37aaf24cf781: Pull complete
da892f4d0cb0: Pull complete
df89c9ce1e48: Pull complete
d2a3165daa7e: Pull complete
663286a455ab: Pull complete
4f4fb700ef54: Pull complete
35693ee7cdbf: Pull complete
4215239b5448: Pull complete
Digest: sha256:6f446c6fa612ffd304bede285556cc0190f53c6506f8a7200a69a603261643a6
Status: Downloaded newer image for leofcardoso/pdf2pdfocr:latest
-------------------------------------
File: ./inby.pdf
[2024-07-10 01:00:35.107971] [DEBUG] Tesseract can 'textonly_pdf': True
[2024-07-10 01:00:35.117933] [DEBUG] Tesseract version: 4
[2024-07-10 01:00:35.144010] [DEBUG] Pdftoppm version: 22.2.0
[2024-07-10 01:00:35.151576] [DEBUG] Qpdf version: 10.6.3
[2024-07-10 01:00:35.151798] [DEBUG] Temp dir is /tmp/pdf2pdfocr_F7DGC/
[2024-07-10 01:00:35.151836] [DEBUG] Prefix is F7DGC
[2024-07-10 01:00:35.151884] [DEBUG] Script dir is /usr/local/bin/
[2024-07-10 01:00:35.151972] [DEBUG] Parallel operations will use 40 CPUs
Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1509, in <module>
pdf2ocr = Pdf2PdfOcr(pdf2ocr_args, file_to_process)
File "/usr/local/bin/pdf2pdfocr.py", line 585, in __init__
self.main_pool = multiprocessing.Pool(self.cpu_to_use)
File "/usr/lib/python3.10/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/usr/lib/python3.10/multiprocessing/pool.py", line 235, in __init__
self._worker_handler.start()
File "/usr/lib/python3.10/threading.py", line 935, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Hi there, thank you for reporting.
[2024-07-10 01:00:35.151972] [DEBUG] Parallel operations will use 40 CPUs
Please try troubleshoot reducing number of cores avaliable do pdf2pdfocr. Use "-j" flag with a float number.
-j PARALLEL_PERCENT run this percentual jobs in parallel (0 - 1.0] - multiply with the number of CPU cores (default = 1 [all cores])
@nguyenvulong can you please test "-j" flag?
Hello, I tested with your suggestion, it seems like the error still persists
❯ docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -j0.1 -v -i indy.pdf
-------------------------------------
File: indy.pdf
[2024-07-12 01:34:21.316715] [DEBUG] Tesseract can 'textonly_pdf': True
[2024-07-12 01:34:21.326141] [DEBUG] Tesseract version: 4
[2024-07-12 01:34:21.350514] [DEBUG] Pdftoppm version: 22.2.0
[2024-07-12 01:34:21.358522] [DEBUG] Qpdf version: 10.6.3
[2024-07-12 01:34:21.358743] [DEBUG] Temp dir is /tmp/pdf2pdfocr_2W40R/
[2024-07-12 01:34:21.358780] [DEBUG] Prefix is 2W40R
[2024-07-12 01:34:21.358826] [DEBUG] Script dir is /usr/local/bin/
[2024-07-12 01:34:21.358910] [DEBUG] Parallel operations will use 4 CPUs
Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1509, in <module>
pdf2ocr = Pdf2PdfOcr(pdf2ocr_args, file_to_process)
File "/usr/local/bin/pdf2pdfocr.py", line 585, in __init__
self.main_pool = multiprocessing.Pool(self.cpu_to_use)
File "/usr/lib/python3.10/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/usr/lib/python3.10/multiprocessing/pool.py", line 235, in __init__
self._worker_handler.start()
File "/usr/lib/python3.10/threading.py", line 935, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Also, I found that if I use relative path
then the file will not be found
❯ docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -j0.1 -v -i ../input_pdf/indy.pdf
-------------------------------------
File: ../input_pdf/indy.pdf
[2024-07-12 01:35:46.805611] [DEBUG] Tesseract can 'textonly_pdf': True
[2024-07-12 01:35:46.814532] [DEBUG] Tesseract version: 4
[2024-07-12 01:35:46.834198] [DEBUG] Pdftoppm version: 22.2.0
[2024-07-12 01:35:46.839840] [DEBUG] Qpdf version: 10.6.3
Error: ../input_pdf/indy.pdf not found. Exiting.
Thank you @nguyenvulong
I search for the bug and found this: https://forums.docker.com/t/runtimeerror-cant-start-new-thread/138142/3
But the "--privileged" flag with "docker run" is not recommended due to security issues.
Please try this: https://stackoverflow.com/questions/344203/maximum-number-of-threads-per-process-in-linux
Thank you for your time. The previous issue disappeared when using the privileged
flag, but it stuck at writing the output file
[2024-07-17 05:15:33.797851] [LOG] Converting input file to images...
[2024-07-17 05:15:35.053642] [LOG] Checking blank pages
[2024-07-17 05:15:35.554620] [LOG] Starting OCR with tesseract...
[2024-07-17 05:15:40.063411] [LOG] Waiting for OCR to complete. 0/1 pages completed...
[2024-07-17 05:15:43.068226] [LOG] OCR completed
[2024-07-17 05:15:43.069049] [DEBUG] We have 1 ocr'ed files
[2024-07-17 05:15:43.076980] [DEBUG] Joined ocr'ed PDF files
[2024-07-17 05:15:43.077054] [DEBUG] Merging with OCR
[2024-07-17 05:15:43.134226] [DEBUG] Autorotate skipped
[2024-07-17 05:15:43.134368] [DEBUG] Editing producer
Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1530, in <module>
pdf2ocr.ocr()
File "/usr/local/bin/pdf2pdfocr.py", line 733, in ocr
self.edit_producer()
File "/usr/local/bin/pdf2pdfocr.py", line 1370, in edit_producer
with open(self.output_file, 'wb') as f:
PermissionError: [Errno 13] Permission denied: '/home/docker/indy-OCR.pdf'
Actually, I am more curious whether the problem I had (when running the toy example) is specific to my case - which is the limited number of allowed threads on my machine, or is it a common issue that everyone here also encountered.
I also mentioned about the relative path in the previous comment. Maybe you'd want to check it out just in case.
Looks like your host OS is missing write permission on your working directory (please note the use of "pwd" on command line). Please see https://docs.docker.com/storage/bind-mounts/#choose-the--v-or---mount-flag
In your testcase, working dir $(pwd) in mapping to "/home/docker". You must have write permission to generate output file.
I don't know if the thread issue is a common problem. You are the first to report. :(
About the relative paths, looks like "-v" flag of Docker don't allow ".." to navigate through folders, but relative paths starting in current folder "." should work.
Thank you Leo, I will keep an eye on this. Will reopen the issue if needed. Good day!