使用celery的线程模式, 批量进行pdf解析时, PDFium 存在线程安全问题

Question

使用celery的线程模式, 批量进行pdf解析时, PDFium 存在线程安全问题

Closed this issue 2 months ago · 5 comments

Isfate commented 2 months ago

🔎 Search before asking | 提交之前请先搜索

I have searched the MinerU Readme and found no similar bug report.
I have searched the MinerU Issues and found no similar bug report.
I have searched the MinerU Discussions and found no similar bug report.

🤖 Consult the online AI assistant for assistance | 在线 AI 助手咨询

I have consulted the online AI assistant but was unable to obtain a solution to the issue.

Description of the bug | 错误描述

celery 异步队列, 线程模式, 批量使用PDFium读取不同文件时
[2025-09-17 15:31:27] [/bin/sh]: Traceback (most recent call last):
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 453, in trace_task
[2025-09-17 15:31:27] [/bin/sh]: R = retval = fun(*args, **kwargs)
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 736, in protected_call
[2025-09-17 15:31:27] [/bin/sh]: return self.run(*args, **kwargs)
[2025-09-17 15:31:27] [/bin/sh]: File "/app/mineru-api/task.py", line 200, in magic_file
[2025-09-17 15:31:27] [/bin/sh]: infer_results, all_image_lists, all_pdf_docs, lang_list, ocr_enabled_list = pipeline_doc_analyze(
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/mineru/backend/pipeline/pipeline_analyze.py", line 93, in doc_analyze
[2025-09-17 15:31:27] [/bin/sh]: if classify(pdf_bytes) == 'ocr':
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/mineru/utils/pdf_classify.py", line 29, in classify
[2025-09-17 15:31:27] [/bin/sh]: sample_pdf_bytes = extract_pages(pdf_bytes)
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/mineru/utils/pdf_classify.py", line 191, in extract_pages
[2025-09-17 15:31:27] [/bin/sh]: pdf = pdfium.PdfDocument(src_pdf_bytes)
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/pypdfium2/_helpers/document.py", line 78, in init
[2025-09-17 15:31:27] [/bin/sh]: self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
[2025-09-17 15:31:27] [/bin/sh]: File "/usr/local/lib/python3.10/dist-packages/pypdfium2/_helpers/document.py", line 678, in _open_pdf
[2025-09-17 15:31:27] [/bin/sh]: raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).")
[2025-09-17 15:31:27] [/bin/sh]: pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

经过排查是因为PDFium 的文档加载 (PdfDocument()) 不是线程安全的。
即便加 Python 层的锁，也可能出错，因为 PDFium 内部用到了全局状态。

How to reproduce the bug | 如何复现

.

Operating System Mode | 操作系统类型

Linux

Operating System Version| 操作系统版本

ubuntu 22.04 docker

Python version | Python 版本

3.10

Software version | 软件版本 (mineru --version)

2.0.x

Device mode | 设备模式

cuda

Answer 1 · 2025-09-17T07:50:11.000Z

Hi @Isfate! I'm Dosu and I’m helping the MinerU team.

PDFium（通过 pypdfium2）在多线程环境下（如 Celery 的线程模式）不是线程安全的，原因是其底层 C/C++ 实现存在全局状态，Python 层加锁也无法完全避免并发冲突，容易导致 PDFium: Data format error 等异常。这一问题已在 MinerU 社区和 pypdfium2 官方文档中有明确说明，推荐的解决方案是避免多线程并发调用 PDFium，改用多进程模式（如 Celery 的进程模式或 Python 的 multiprocessing），每个进程独立加载 PDF 并处理任务，这样不会共享全局状态，可以规避线程安全问题。相关讨论和官方说明可参考这里和这里。

如果必须批量处理 PDF，可以参考 MinerU 仓库中的 multi_process_mineru.py 示例脚本，采用多进程批量处理方式。对于 GPU 场景，务必保证每个进程独占显存，避免资源竞争。如果遇到复杂 PDF 或特殊格式，也可以尝试减小批量、切换 pipeline 后端或强制 OCR 模式（--force_OCR），并检查 PDF 文件完整性。详细说明。

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Answer 2 · 2025-09-17T10:02:51.000Z

但是我的显存有限, 没有充足的资源使用多进程方式, 我已经将pypdfium2 改为了 pypdf2+pymupdf

Answer 3 · 2025-09-29T02:18:29.000Z

但是我的显存有限, 没有充足的资源使用多进程方式, 我已经将pypdfium2 改为了 pypdf2+pymupdf

您好我最近也遇到了这个问题，可以分享下您的解决方案嘛？感谢~

Answer 4 · 2025-09-29T02:49:54.000Z

build.zip
我大概改了这几个文件, 不过可能还有遗漏的地方, 不过目前还没发现

但是我的显存有限, 没有充足的资源使用多进程方式, 我已经将pypdfium2 改为了 pypdf2+pymupdf

您好我最近也遇到了这个问题，可以分享下您的解决方案嘛？感谢~

Answer 5 · 2025-09-29T07:42:29.000Z

build.zip 我大概改了这几个文件, 不过可能还有遗漏的地方, 不过目前还没发现

但是我的显存有限, 没有充足的资源使用多进程方式, 我已经将pypdfium2 改为了 pypdf2+pymupdf

您好我最近也遇到了这个问题，可以分享下您的解决方案嘛？感谢~

好的非常感谢~