PaddlePaddle/PaddleNLP

[Bug]: pipelines中语义检索系统,启动运行后,上传扫描式PDF文件 无法解析

Opened this issue · 1 comments

软件环境

paddle-pipelines               0.6.2
paddle2onnx                    1.2.1
paddlefsl                      1.1.0
paddlenlp                      2.8.0
paddleocr                      2.7.3
paddlepaddle-gpu               2.6.0.post117

重复问题

  • I have searched the existing issues

错误描述

INFO:     127.0.0.1:43132 - "POST /file-upload HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/pipelines/base.py", line 446, in run
    node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(**node_input)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/base.py", line 120, in _dispatch_run
    return self._dispatch_run_general(self.run, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/base.py", line 164, in _dispatch_run_general
    output, stream = run_method(**run_inputs, **run_params)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/base.py", line 144, in run
    output, stream = run_indexing(documents=documents, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/base.py", line 110, in wrapper
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/base.py", line 229, in run_indexing
    embeddings = self.embed_documents(document_objects, **kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/dense.py", line 367, in embed_documents
    embeddings = self._get_predictions(passages, **kwargs)["passages"]
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/dense.py", line 292, in _get_predictions
    if "passages" in dicts[0]:
IndexError: list index out of range

稳定复现步骤 & 代码

在网页端,左侧文件上传模块,上传扫描式PDF文件 无法解析。上传非扫描件PDF,正常。
对于扫描式PDF文件,是此repo本来无法解析,还是我哪个组件没安装?

您好,目前不支持扫描件的PDF,欢迎开发者贡献。