A FastAPI application that processes PDF files containing language school information using OpenAI's GPT-4 Vision API. The application extracts structured data about courses, accommodations, and pricing.
Read Notion.md for more details.
- Asynchronous PDF processing with background jobs
- Redis-based job queue system
- Colored logging with file and console output
- Docker containerization
- Callback support for job completion notifications
- Structured data extraction using Pydantic models
- Automatic API documentation with Swagger UI
- Python 3.9+
- Docker and Docker Compose
- OpenAI API key
- Redis server
- Clone the repository:
git clone https://github.com/concaption/school-info-parser.git
cd school-info-parser
- Create and populate .env file:
OPENAI_API_KEY=your_api_key_here
REDIS_HOST=redis
- Build and run with Docker Compose:
docker-compose up --build
GET /
- Redirects to API documentationPOST /submit-job/
- Submit PDFs for processingGET /job/{job_id}
- Check job status and results
- Access the API documentation:
http://localhost:8000/docs
- Submit a PDF file for processing:
curl -X POST "http://localhost:8000/submit-job/" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "files=@your_pdf_file.pdf"
- Check job status:
curl -X GET "http://localhost:8000/job/{job_id}"
- Create a virtual environment:
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
- Install dependencies:
pip install -r requirements.txt
- Run tests:
pytest
school-info-parser/
├── src/
│ ├── parser.py # PDF processing logic
│ ├── schema.py # Pydantic models
│ ├── logger.py # Logging configuration
│ ├── prompts.py # OpenAI system prompts
│ └── utils.py # Utility functions
├── logs/ # Application logs
├── main.py # FastAPI application
├── Dockerfile # Container definition
└── docker-compose.yml # Container orchestration
graph TB
Client[Client] --> API[FastAPI Application]
API --> Redis[(Redis Queue)]
API --> Logger[Logger System]
subgraph Worker Processing
Redis --> Worker[Background Worker]
Worker --> PDFProcessor[PDF Processor]
PDFProcessor --> OpenAI[OpenAI GPT-4V API]
PDFProcessor --> Storage[File Storage]
end
Logger --> FileSystem[File System Logs]
Logger --> Console[Console Output]
Worker --> Callback[Callback URL]
Worker --> Results[(Results Storage)]
sequenceDiagram
participant C as Client
participant A as FastAPI
participant R as Redis
participant W as Worker
participant P as PDF Processor
participant O as OpenAI API
participant CB as Callback URL
C->>A: POST /submit-job/ (PDF files)
A->>A: Generate job_id
A->>R: Store initial job status
A->>C: Return job_id
activate W
W->>R: Poll for new jobs
R-->>W: Job details
W->>P: Process PDF
loop Each Page
P->>O: Send image for analysis
O-->>P: Return structured data
P->>P: Merge results
end
W->>R: Update job status
opt If callback_url provided
W->>CB: Send results
end
deactivate W
C->>A: GET /job/{job_id}
A->>R: Get job status
R-->>A: Return results
A->>C: Return job status/results
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
MIT License - see LICENSE file for details
- OpenAI for GPT-4 Vision API
- FastAPI for the web framework
- PyMuPDF for PDF processing