/school-info-parser

Extract structured data about courses, accommodations, and pricing from school prospectuses

Primary LanguageJupyter Notebook

School Information Parser

A FastAPI application that processes PDF files containing language school information using OpenAI's GPT-4 Vision API. The application extracts structured data about courses, accommodations, and pricing.

Open in GitHub Codespaces

School Information Parser - Watch Video

Read Notion.md for more details.

Features

  • Asynchronous PDF processing with background jobs
  • Redis-based job queue system
  • Colored logging with file and console output
  • Docker containerization
  • Callback support for job completion notifications
  • Structured data extraction using Pydantic models
  • Automatic API documentation with Swagger UI

Prerequisites

  • Python 3.9+
  • Docker and Docker Compose
  • OpenAI API key
  • Redis server

Installation

  1. Clone the repository:
git clone https://github.com/concaption/school-info-parser.git
cd school-info-parser
  1. Create and populate .env file:
OPENAI_API_KEY=your_api_key_here
REDIS_HOST=redis
  1. Build and run with Docker Compose:
docker-compose up --build

API Endpoints

  • GET / - Redirects to API documentation
  • POST /submit-job/ - Submit PDFs for processing
  • GET /job/{job_id} - Check job status and results

Usage

  1. Access the API documentation:
http://localhost:8000/docs
  1. Submit a PDF file for processing:
curl -X POST "http://localhost:8000/submit-job/" \
     -H "accept: application/json" \
     -H "Content-Type: multipart/form-data" \
     -F "files=@your_pdf_file.pdf"
  1. Check job status:
curl -X GET "http://localhost:8000/job/{job_id}"

Development

  1. Create a virtual environment:
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate     # Windows
  1. Install dependencies:
pip install -r requirements.txt
  1. Run tests:
pytest

Project Structure

school-info-parser/
├── src/
│   ├── parser.py      # PDF processing logic
│   ├── schema.py      # Pydantic models
│   ├── logger.py      # Logging configuration
│   ├── prompts.py     # OpenAI system prompts
│   └── utils.py       # Utility functions
├── logs/              # Application logs
├── main.py           # FastAPI application
├── Dockerfile        # Container definition
└── docker-compose.yml # Container orchestration

Architecture

System Architecture

graph TB
    Client[Client] --> API[FastAPI Application]
    API --> Redis[(Redis Queue)]
    API --> Logger[Logger System]
    
    subgraph Worker Processing
        Redis --> Worker[Background Worker]
        Worker --> PDFProcessor[PDF Processor]
        PDFProcessor --> OpenAI[OpenAI GPT-4V API]
        PDFProcessor --> Storage[File Storage]
    end
    
    Logger --> FileSystem[File System Logs]
    Logger --> Console[Console Output]
    
    Worker --> Callback[Callback URL]
    Worker --> Results[(Results Storage)]
Loading

Workflow Diagram

sequenceDiagram
    participant C as Client
    participant A as FastAPI
    participant R as Redis
    participant W as Worker
    participant P as PDF Processor
    participant O as OpenAI API
    participant CB as Callback URL

    C->>A: POST /submit-job/ (PDF files)
    A->>A: Generate job_id
    A->>R: Store initial job status
    A->>C: Return job_id
    
    activate W
    W->>R: Poll for new jobs
    R-->>W: Job details
    W->>P: Process PDF
    
    loop Each Page
        P->>O: Send image for analysis
        O-->>P: Return structured data
        P->>P: Merge results
    end
    
    W->>R: Update job status
    
    opt If callback_url provided
        W->>CB: Send results
    end
    deactivate W
    
    C->>A: GET /job/{job_id}
    A->>R: Get job status
    R-->>A: Return results
    A->>C: Return job status/results
Loading

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

MIT License - see LICENSE file for details

Acknowledgments

  • OpenAI for GPT-4 Vision API
  • FastAPI for the web framework
  • PyMuPDF for PDF processing