PDF to Questions Generator

A simple Mastra template that processes PDF files and generates comprehensive questions from their content using OpenAI GPT-4o.

Overview

This template demonstrates a straightforward workflow:

Input: PDF URL
Download: Fetch the PDF file
Extract Text: Parse PDF using pure JavaScript (no system dependencies!)
Generate Questions: Create questions using OpenAI GPT-4o

Prerequisites

Node.js 20.9.0 or higher
OpenAI API key (that's it!)

Setup

Clone and install dependencies:

git clone <repository-url>
cd template-pdf-questions
pnpm install

Set up environment variables:

cp env.example .env
# Edit .env and add your OpenAI API key

Run the example:

export OPENAI_API_KEY="your-real-api-key-here"
npx tsx example.ts

Usage

Using the Workflow

import { mastra } from './src/mastra/index';

const run = await mastra.getWorkflow('pdfToQuestionsWorkflow').createRunAsync();

// Using a PDF URL
const result = await run.start({
  inputData: {
    pdfUrl: 'https://example.com/document.pdf',
  },
});

console.log(result.result.questions);

Using the PDF Questions Agent

import { mastra } from './src/mastra/index';

const agent = mastra.getAgent('pdfQuestionsAgent');

// The agent can handle the full process with natural language
const response = await agent.stream([
  {
    role: 'user',
    content: 'Please download this PDF and generate questions from it: https://example.com/document.pdf',
  },
]);

for await (const chunk of response.textStream) {
  console.log(chunk);
}

Using Individual Tools

import { mastra } from './src/mastra/index';
import { pdfFetcherTool } from './src/mastra/tools/pdf-fetcher-tool';
import { textExtractorTool } from './src/mastra/tools/text-extractor-tool';
import { questionGeneratorTool } from './src/mastra/tools/question-generator-tool';

// Step 1: Download PDF
const pdfResult = await pdfFetcherTool.execute({
  context: { pdfUrl: 'https://example.com/document.pdf' },
  runtimeContext: new RuntimeContext(),
});

// Step 2: Extract text
const textResult = await textExtractorTool.execute({
  context: { pdfBuffer: pdfResult.pdfBuffer },
  runtimeContext: new RuntimeContext(),
});

// Step 3: Generate questions
const questionsResult = await questionGeneratorTool.execute({
  context: {
    extractedText: textResult.extractedText,
    maxQuestions: 10
  },
  mastra,
  runtimeContext: new RuntimeContext(),
});

console.log(questionsResult.questions);

Expected Output

{
  status: 'success',
  result: {
    questions: [
      "What is the main objective of the research presented in this paper?",
      "Which methodology was used to collect the data?",
      "What are the key findings of the study?",
      // ... more questions
    ],
    success: true
  }
}

Architecture

Components

pdfToQuestionsWorkflow: Main workflow orchestrating the process
questionGeneratorAgent: Mastra agent specialized in generating educational questions
pdfQuestionsAgent: Complete agent that can handle the full PDF to questions pipeline
simpleOCR: Pure JavaScript PDF text extraction (no system dependencies)

Tools

pdfFetcherTool: Downloads PDF files from URLs and returns buffers
textExtractorTool: Extracts text from PDF buffers using OCR
questionGeneratorTool: Generates comprehensive questions from extracted text

Workflow Steps

download-pdf: Downloads PDF from provided URL
extract-text: Extracts text using JavaScript PDF parser (pdf2json)
generate-questions: Creates comprehensive questions using the question generator agent

Features

✅ Zero System Dependencies: Pure JavaScript solution
✅ Simple Setup: Only requires OpenAI API key
✅ Fast Text Extraction: Direct PDF parsing (no OCR needed for text-based PDFs)
✅ Educational Focus: Generates comprehensive learning questions
✅ Multiple Interfaces: Workflow, Agent, and individual tools available

How It Works

Text Extraction Strategy

This template uses a pure JavaScript approach that works for most PDFs:

Text-based PDFs (90% of cases): Direct text extraction using pdf2json
- ⚡ Fast and reliable
- 🔧 No system dependencies
- ✅ Works out of the box
Scanned PDFs: Would require OCR, but most PDFs today contain embedded text

Why This Approach?

Simplicity: No GraphicsMagick, ImageMagick, or other system tools needed
Speed: Direct text extraction is much faster than OCR
Reliability: Works consistently across different environments
Educational: Easy for developers to understand and modify
Single Path: One clear workflow with no complex branching

Configuration

Environment Variables

OPENAI_API_KEY=your_openai_api_key_here

Customization

You can customize the question generation by modifying the questionGeneratorAgent:

export const questionGeneratorAgent = new Agent({
  name: 'Question Generator Pro',
  instructions: `
    You are an expert educational content creator...
    // Customize instructions here
  `,
  model: openai('gpt-4o'),
});

Development

Project Structure

src/mastra/
├── agents/
│   └── questionGeneratorAgent.ts    # Question generation agent
├── tools/
│   └── simpleOCR.ts                 # Pure JavaScript PDF parser
├── workflows/
│   └── pdfToQuestionsWorkflow.ts    # Main workflow
└── index.ts                         # Mastra configuration

Testing

# Run with a test PDF
export OPENAI_API_KEY="your-api-key"
npx tsx example.ts

Common Issues

"OPENAI_API_KEY is not set"

Make sure you've set the environment variable
Check that your API key is valid and has sufficient credits

"Failed to download PDF"

Verify the PDF URL is accessible and publicly available
Check network connectivity
Ensure the URL points to a valid PDF file
Some servers may require authentication or have restrictions

"No text could be extracted"

The PDF might be password-protected
Very large PDFs might take longer to process
Scanned PDFs without embedded text won't work (rare with modern PDFs)

"Context length exceeded" or Token Limit Errors

Solution: Use a smaller PDF file (under ~5-10 pages)
Automatic Truncation: The tool automatically uses only the first 4000 characters for very large documents
Helpful Errors: Clear messages guide you to use smaller PDFs when needed

What Makes This Template Special

🎯 True Simplicity

Single dependency for PDF processing (pdf2json)
No system tools or complex setup required
Works immediately after pnpm install
Multiple usage patterns (workflow, agent, tools)

⚡ Performance

Direct text extraction (no image conversion)
Much faster than OCR-based approaches
Handles reasonably-sized documents efficiently

🔧 Developer-Friendly

Pure JavaScript/TypeScript
Easy to understand and modify
Clear separation of concerns
Simple error handling with helpful messages

📚 Educational Value

Generates multiple question types
Covers different comprehension levels
Perfect for creating study materials

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

bookercodes/template-pdf-questions