Issue on PDF Loader - Embeddings:

Question

Issue on PDF Loader - Embeddings:

Opened this issue a year ago · 2 comments

Hello, finally I found a PDF Q&A with a free alternative to OpenAI.
I'm testing the code, but I'm 200% iliterate and dumb in coding. I'm trying to build a Gradio/Streamlit App to answers questions on a specific topic basically like Lego and using ChatGTP to help me out. Maybe this APP will give me visibility on the market since I got fired on Jan (my market isnt coding related).

Can you help me out to figure this error?
Thanks!

WARNING:unstructured:detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with another strategy.
WARNING:unstructured:Falling back to partitioning with ocr_only.

FileNotFoundError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout)
567 env["LD_LIBRARY_PATH"] = poppler_path + ":" + env.get("LD_LIBRARY_PATH", "")
--> 568 proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
569

10 frames
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

During handling of the above exception, another exception occurred:

PDFInfoNotInstalledError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout)
592
593 except OSError:
--> 594 raise PDFInfoNotInstalledError(
595 "Unable to get page count. Is poppler installed and in PATH?"
596 )

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Answer 1 · 2023-05-09T18:34:19.000Z

Seem to be missing a !pip install pdf-info, but theres a new error:

ImportError Traceback (most recent call last)
in <cell line: 1>()
1 index = VectorstoreIndexCreator(
2 embedding=HuggingFaceEmbeddings(),
----> 3 text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)).from_loaders(loaders)

4 frames
/usr/local/lib/python3.10/dist-packages/pdfminer/high_level.py in
6 from typing import Any, BinaryIO, Container, Iterator, Optional, cast
7
----> 8 from .converter import (
9 XMLConverter,
10 HTMLConverter,

ImportError: cannot import name 'HOCRConverter' from 'pdfminer.converter' (/usr/local/lib/python3.10/dist-packages/pdfminer/converter.py)

Answer 2 · 2023-11-27T14:08:23.000Z

Funktioniert nur mit Python 3.9 und besser noch wsl/linux Ubuntu 22.04

Works only with Python 3.9 and better still wsl/linux Ubuntu 22.04

WARNING:unstructured:detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with another strategy. WARNING:unstructured:Falling back to partitioning with ocr_only.

WARNING:unstructured:detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with another strategy.
WARNING:unstructured:Falling back to partitioning with ocr_only.