/pdf-loader

A FiftyOne plugin for loading PDF documents as images

Primary LanguagePython

PDF Loader Plugin

A FiftyOne plugin for loading PDFs as images.

pdf-loader.mp4

Installation

If you haven't already, install FiftyOne:

pip install fiftyone

Then install the plugin and its dependencies:

fiftyone plugins download https://github.com/brimoor/pdf-loader

brew install poppler
pip install pdf2image

Usage

  1. Launch the App:
import fiftyone as fo

dataset = fo.Dataset()
session = fo.launch_app(dataset)
  1. Press ` or click the Browse operations icon above the grid

  2. Run the pdf_loader operator

What next?

Install the PyTesseract OCR and Semantic Document Search plugins to make your documents searchable!

brian-thesis-search.mp4
  1. Install the plugins and their dependencies:
fiftyone plugins download https://github.com/jacobmarks/pytesseract-ocr-plugin
pip install pytesseract

https://github.com/jacobmarks/semantic-document-search-plugin
pip install qdrant_client
pip install sentence_transformers
  1. Launch a Qdrant server:
docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant
  1. Run the run_ocr_engine operator to detect text blocks

  2. Run the create_semantic_document_index operator to generate a semantic index for the text blocks

  3. Run the semantically_search_documents operator to perform arbitrary searches against the index!

Implementation

This plugin is a basically a wrapper around the following code:

import os
from pdf2image import convert_from_path

INPUT_PATH = "/path/to/your.pdf"
OUTPUT_DIR = "/path/for/page/images"

os.makedirs(OUTPUT_DIR, exist_ok=True)
convert_from_path(INPUT_PATH, output_folder=OUTPUT_DIR, fmt="jpg")

dataset.add_images_dir(OUTPUT_DIR)