使用celery的线程模式, 批量进行pdf解析时, PDFium 存在线程安全问题
Closed this issue · 5 comments
🔎 Search before asking | 提交之前请先搜索
- I have searched the MinerU Readme and found no similar bug report.
- I have searched the MinerU Issues and found no similar bug report.
- I have searched the MinerU Discussions and found no similar bug report.
🤖 Consult the online AI assistant for assistance | 在线 AI 助手咨询
- I have consulted the online AI assistant but was unable to obtain a solution to the issue.
Description of the bug | 错误描述
celery 异步队列, 线程模式, 批量使用PDFium读取不同文件时
[2025-09-17 15:31:27] [/bin/sh]: Traceback (most recent call last):
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 453, in trace_task
[2025-09-17 15:31:27] [/bin/sh]: R = retval = fun(*args, **kwargs)
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 736, in protected_call
[2025-09-17 15:31:27] [/bin/sh]: return self.run(*args, **kwargs)
[2025-09-17 15:31:27] [/bin/sh]: File "/app/mineru-api/task.py", line 200, in magic_file
[2025-09-17 15:31:27] [/bin/sh]: infer_results, all_image_lists, all_pdf_docs, lang_list, ocr_enabled_list = pipeline_doc_analyze(
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/mineru/backend/pipeline/pipeline_analyze.py", line 93, in doc_analyze
[2025-09-17 15:31:27] [/bin/sh]: if classify(pdf_bytes) == 'ocr':
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/mineru/utils/pdf_classify.py", line 29, in classify
[2025-09-17 15:31:27] [/bin/sh]: sample_pdf_bytes = extract_pages(pdf_bytes)
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/mineru/utils/pdf_classify.py", line 191, in extract_pages
[2025-09-17 15:31:27] [/bin/sh]: pdf = pdfium.PdfDocument(src_pdf_bytes)
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/pypdfium2/_helpers/document.py", line 78, in init
[2025-09-17 15:31:27] [/bin/sh]: self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/pypdfium2/_helpers/document.py", line 678, in _open_pdf
[2025-09-17 15:31:27] [/bin/sh]: raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).")
[2025-09-17 15:31:27] [/bin/sh]: pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).
经过排查是因为PDFium 的 文档加载 (PdfDocument()) 不是线程安全的。
即便加 Python 层的锁,也可能出错,因为 PDFium 内部用到了全局状态。
How to reproduce the bug | 如何复现
.
Operating System Mode | 操作系统类型
Linux
Operating System Version| 操作系统版本
ubuntu 22.04 docker
Python version | Python 版本
3.10
Software version | 软件版本 (mineru --version)
2.0.x
Device mode | 设备模式
cuda
Hi @Isfate! I'm Dosu and I’m helping the MinerU team.
PDFium(通过 pypdfium2)在多线程环境下(如 Celery 的线程模式)不是线程安全的,原因是其底层 C/C++ 实现存在全局状态,Python 层加锁也无法完全避免并发冲突,容易导致 PDFium: Data format error 等异常。这一问题已在 MinerU 社区和 pypdfium2 官方文档中有明确说明,推荐的解决方案是避免多线程并发调用 PDFium,改用多进程模式(如 Celery 的进程模式或 Python 的 multiprocessing),每个进程独立加载 PDF 并处理任务,这样不会共享全局状态,可以规避线程安全问题。相关讨论和官方说明可参考 这里 和 这里。
如果必须批量处理 PDF,可以参考 MinerU 仓库中的 multi_process_mineru.py 示例脚本,采用多进程批量处理方式。对于 GPU 场景,务必保证每个进程独占显存,避免资源竞争。如果遇到复杂 PDF 或特殊格式,也可以尝试减小批量、切换 pipeline 后端或强制 OCR 模式(--force_OCR),并检查 PDF 文件完整性。详细说明。
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
但是我的显存有限, 没有充足的资源使用多进程方式, 我已经将pypdfium2 改为了 pypdf2+pymupdf
但是我的显存有限, 没有充足的资源使用多进程方式, 我已经将pypdfium2 改为了 pypdf2+pymupdf
您好 我最近也遇到了这个问题,可以分享下您的解决方案嘛?感谢~