A FiftyOne plugin for loading PDFs as images.
pdf-loader.mp4
If you haven't already, install FiftyOne:
pip install fiftyone
Then install the plugin and its dependencies:
fiftyone plugins download https://github.com/brimoor/pdf-loader
brew install poppler
pip install pdf2image
- Launch the App:
import fiftyone as fo
dataset = fo.Dataset()
session = fo.launch_app(dataset)
-
Press
`
or click theBrowse operations
icon above the grid -
Run the
pdf_loader
operator
Install the PyTesseract OCR and Semantic Document Search plugins to make your documents searchable!
brian-thesis-search.mp4
- Install the plugins and their dependencies:
fiftyone plugins download https://github.com/jacobmarks/pytesseract-ocr-plugin
pip install pytesseract
https://github.com/jacobmarks/semantic-document-search-plugin
pip install qdrant_client
pip install sentence_transformers
- Launch a Qdrant server:
docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant
-
Run the
run_ocr_engine
operator to detect text blocks -
Run the
create_semantic_document_index
operator to generate a semantic index for the text blocks -
Run the
semantically_search_documents
operator to perform arbitrary searches against the index!
This plugin is a basically a wrapper around the following code:
import os
from pdf2image import convert_from_path
INPUT_PATH = "/path/to/your.pdf"
OUTPUT_DIR = "/path/for/page/images"
os.makedirs(OUTPUT_DIR, exist_ok=True)
convert_from_path(INPUT_PATH, output_folder=OUTPUT_DIR, fmt="jpg")
dataset.add_images_dir(OUTPUT_DIR)