PDFz

Developed by codad5

PDFz streamlines the extraction and processing of text from PDF files so that you can manage and analyze large volumes of documents effortlessly. By leveraging a microservices architecture, PDFz achieves high performance through:

  • Extractor Service (Rust): Processes PDF files and extracts text using configurable extraction engines. While Tesseract OCR is supported, PDFz is designed to work with multiple extraction methods.
  • API Service (Express & TypeScript): Provides endpoints for file uploads, processing, progress tracking, and interacting with advanced extraction and model-based processing.
  • Redis: Caches and tracks file and model processing progress.
  • RabbitMQ: Manages message queuing between services.
  • Model-Based Processing: Integrate with engines like Ollama for advanced text processing using locally hosted large language models (LLMs).

Features

  • File Upload: Send PDF files to the API.
  • Multi-Engine File Processing: Choose your extraction engine—whether Tesseract OCR, Ollama, or others—to process PDFs asynchronously.
  • OCR & Model-Based Extraction:
    • Use Tesseract OCR for traditional optical character recognition.
    • Leverage model-based extraction (e.g., using Ollama) for advanced processing such as summarization, question-answering, or generating insights.
  • Progress Tracking: Monitor file processing progress in real time.
  • Processed Content Retrieval: Get back JSON with extracted content.
  • Model Management:
    • Pull and download a specified model if it isn’t available locally.
    • Track model download progress.
    • List available models for advanced extraction needs.

Architecture

  • API Service (Express & TypeScript):
    Provides endpoints for:

    • Web Interface files (/web)
    • Uploading files (/upload)
    • Initiating file processing (/process/:id)
    • Checking file processing progress (/progress/:id)
    • Retrieving processed content (/content/:id)
    • Managing models (pulling via /model/pull, tracking progress with /model/progress/:name, and listing models with /models)
  • Extractor Service (Rust):
    Processes queued PDF files using the chosen extraction engine. It supports both traditional OCR (e.g., Tesseract) and model-based extraction (e.g., via Ollama) and interacts with Redis and RabbitMQ for job tracking.

  • Redis:
    Maintains state and progress information for file and model processing.

  • RabbitMQ:
    Facilitates job dispatching between the API and Extractor services.

  • Ollama & Other Engines:
    Provides advanced processing capabilities by serving locally hosted language models. The system is extensible to support additional extraction or processing engines in the future.


API Endpoints

Welcome

GET /

Returns a welcome message:

PDFz server is life 🔥🔥

Web Interface

GET /web
  • Shows the web interface

Upload a File

POST /upload

Request: Multipart form-data containing a pdf file.

Response Example:

{
  "success": true,
  "message": "File uploaded successfully",
  "data": {
    "id": "file-id",
    "filename": "file.pdf",
    "path": "/shared_storage/upload/pdf/file.pdf",
    "size": 12345
  }
}

Process a File

POST /process/:id

Request: JSON body with processing options:

  • startPage (default: 1)
  • pageCount (default: 0)
  • priority (default: 1)
  • engine — extraction engine (e.g., "tesseract" or "ollama")
  • model — required if the selected engine is model-based (e.g., "ollama")

Examples:

Using Tesseract:

{
  "startPage": 1,
  "pageCount": 10,
  "priority": 1,
  "engine": "tesseract"
}

Using Ollama:

{
  "startPage": 1,
  "pageCount": 10,
  "priority": 1,
  "engine": "ollama",
  "model": "llama3.2-vision"  // ":latest" will be appended if no tag is provided
}

Response Example:

{
  "success": true,
  "message": "File processing started",
  "data": {
    "id": "file-id",
    "file": "file.pdf",
    "options": {
      "startPage": 1,
      "pageCount": 10,
      "priority": 1
    },
    "status": "queued",
    "progress": 0,
    "queuedAt": "2023-10-01T12:00:00Z"
  }
}

Track File Processing Progress

GET /progress/:id

Response Example:

{
  "success": true,
  "message": "Progress retrieved successfully",
  "data": {
    "id": "file-id",
    "progress": 50,
    "status": "processing"
  }
}

Retrieve Processed Content

GET /content/:id

Response Example:

{
  "success": true,
  "message": "Processed content retrieved successfully",
  "data": {
    "id": "file-id",
    "content": [
      {
        "page_num": 1,
        "text": "Text from page 1."
      },
      {
        "page_num": 2,
        "text": "Text from page 2."
      }
    ],
    "status": "completed"
  }
}

Pull a Model (for Model-Based Extraction)

POST /model/pull

Request: JSON body with the model name:

{
  "model": "model-name"
}

Response Examples:

  • If the model already exists:

    {
      "success": true,
      "message": "Model already exists locally",
      "model": "model-name",
      "status": "exists"
    }
  • If the model is queued for download:

    {
      "success": true,
      "message": "Model download queued successfully",
      "model": "model-name",
      "status": "queued",
      "progress": 0
    }

Track Model Download Progress

GET /model/progress/:name

Response Example:

{
  "success": true,
  "message": "Model progress retrieved successfully",
  "data": {
    "name": "model-name",
    "progress": 75,
    "status": "downloading"
  }
}

List Available Models

GET /models

Response Example:

{
  "success": true,
  "message": "Models retrieved successfully",
  "data": {
    "models": [
      {
        "name": "model1:latest",
        "size": "1.2GB",
        "modified_at": "2023-10-01T12:00:00Z"
      },
      {
        "name": "model2:latest",
        "size": "900MB",
        "modified_at": "2023-09-28T08:30:00Z"
      }
    ]
  }
}

Setup

Prerequisites

For Docker Deployment:

  • Docker & Docker Compose

For Local Development:

API Service (Node.js & Express):

  • Node.js & npm
  • Redis
  • RabbitMQ

Extractor Service (Rust):

  • Rust & Cargo
  • Redis
  • RabbitMQ
  • At least one extraction engine (e.g., Tesseract OCR or an alternative)

Ollama Service (for model-based extraction):

  • Docker container (or a local installation of Ollama)

Installation

  1. Clone the Repository:

    git clone https://github.com/codad5/pdfz.git
    cd pdfz
  2. Create an .env File:

    cp .env.example .env
  3. Update Environment Variables:
    Modify the .env file to set your ports, RabbitMQ and Redis credentials, and extraction/model settings.

  4. Build and Start the Services:

    docker-compose up --build

Services & Environment Variables

Extractor Service (Rust)

  • RUST_LOG=debug
  • REDIS_URL — Redis connection URL
  • RABBITMQ_URL — RabbitMQ connection URL (e.g., amqp://user:pass@rabbitmq:5672)
  • EXTRACTOR_PORT — Port for the Extractor Service
  • SHARED_STORAGE_PATH — Mount point for file storage
  • TRAINING_DATA_PATH — Path to training data for extraction engines
  • OLLAMA_BASE_URL — Base URL for Ollama (e.g., http://ollama:11434)
  • OLLAMA_BASE_PORT — Ollama port (e.g., 11434)
  • OLLAMA_BASE_HOST — Host for Ollama

API Service (Node.js)

  • NODE_ENV=development
  • REDIS_URL — Redis connection URL
  • RABBITMQ_URL — RabbitMQ connection URL
  • API_PORT — Port for the API service
  • SHARED_STORAGE_PATH — Mount point for file storage
  • RABBITMQ_EXTRACTOR_QUEUE — Queue name for file extraction requests
  • OLLAMA_BASE_URL — Base URL for Ollama
  • OLLAMA_BASE_PORT — Ollama port
  • OLLAMA_BASE_HOST — Host for Ollama

Docker Compose Setup

Check the docker-compose.yml file to see the defined services:


Repository

For more details, visit the GitHub repository.


Contributing

  1. Fork the repository and create a new branch.
  2. Make changes and test locally.
  3. Submit a pull request.

License

MIT License