PDF Loader Plugin

A FiftyOne plugin for loading PDFs as images.

pdf-loader.mp4

Installation

If you haven't already, install FiftyOne:

pip install fiftyone

Then install the plugin and its dependencies:

fiftyone plugins download https://github.com/brimoor/pdf-loader

brew install poppler
pip install pdf2image

Usage

Launch the App:

import fiftyone as fo

dataset = fo.Dataset()
session = fo.launch_app(dataset)

Press ` or click the Browse operations icon above the grid
Run the pdf_loader operator

What next?

Install the PyTesseract OCR and Semantic Document Search plugins to make your documents searchable!

brian-thesis-search.mp4

Install the plugins and their dependencies:

fiftyone plugins download https://github.com/jacobmarks/pytesseract-ocr-plugin
pip install pytesseract

https://github.com/jacobmarks/semantic-document-search-plugin
pip install qdrant_client
pip install sentence_transformers

Launch a Qdrant server:

docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant

Run the run_ocr_engine operator to detect text blocks
Run the create_semantic_document_index operator to generate a semantic index for the text blocks
Run the semantically_search_documents operator to perform arbitrary searches against the index!

Implementation

This plugin is a basically a wrapper around the following code:

import os
from pdf2image import convert_from_path

INPUT_PATH = "/path/to/your.pdf"
OUTPUT_DIR = "/path/for/page/images"

os.makedirs(OUTPUT_DIR, exist_ok=True)
convert_from_path(INPUT_PATH, output_folder=OUTPUT_DIR, fmt="jpg")

dataset.add_images_dir(OUTPUT_DIR)