⚠️ WARNING: This server is currently in alpha stage and is intended for local development and testing only. It should not be deployed in production environments or exposed to untrusted networks. Please review SECURITY.md for important security considerations before running this server.
MCP server providing tools to extract text, metadata, layout, and search documents (primarily PDFs). This server implements the Model Context Protocol (MCP) specification, allowing AI models to interact with PDF documents through a standardized interface.
The Document Understanding MCP Server provides a set of tools for extracting information from PDF documents, including:
- Text content extraction (with OCR fallback for scanned documents)
- Metadata extraction (author, title, creation date, etc.)
- Layout information extraction (text blocks, images, drawings)
- Table extraction
- Image extraction
- Document outline/bookmarks extraction
- Text search functionality
- Language detection
These tools can be used by AI models to analyze and understand PDF documents, enabling more sophisticated document processing workflows.
Future Plans: We're working on expanding support to multiple document types beyond PDF. See docs/plans/README.md for details on upcoming architectural changes.
- Clone the repository:
git clone <repo-url> cd document-understanding-mcp-server
- Install Python: Ensure you have Python 3.11+ installed.
- Install
uv: It's recommended to useuvfor environment management.pip install uv
- Create and Sync Virtual Environment:
uv venv # Create .venv # Install core dependencies uv pip sync pyproject.toml # For development/testing, install optional extras: uv pip install -e '.[dev,test]'
- (Required for Table Extraction) Install Java: The
extract_tablestool relies ontabula-py, which requires a Java Runtime Environment (JRE). Install a JRE (e.g., OpenJDK) suitable for your system and ensurejavais accessible in your system's PATH. - (Required for OCR Fallback) Install Tesseract: The OCR fallback in
extract_pdf_contentsrelies onpytesseract, which requires the Tesseract OCR engine itself. Install Tesseract for your OS (see Tesseract Installation) and ensure thetesseractcommand is in your system's PATH. - Configure Base Path (Mandatory unless overridden): Define the
DOCUMENT_UNDERSTANDING_BASE_PATHenvironment variable. This is the only directory the server will read PDFs from by default. Example:export DOCUMENT_UNDERSTANDING_BASE_PATH=/path/to/your/pdf_working_directory
DOCUMENT_UNDERSTANDING_BASE_PATH(Environment Variable, Required):* Specifies the absolute path to the directory the server is allowed to access for reading PDFs. The server will refuse to start if this is not set, unless--allow-any-pathis used.DOCUMENT_UNDERSTANDING_LOG_LEVEL(Environment Variable): Sets the minimum log level for console output (e.g.,DEBUG,INFO,WARNING). Default=INFO.DOCUMENT_UNDERSTANDING_LOG_FORMAT(Environment Variable): Sets console log format (plainorjson). Default=plain.DOCUMENT_UNDERSTANDING_LOG_FILE(Environment Variable): If set, specifies a path to write structured JSON logs (e.g.,.tmp/logs/server.log). Directory will be created. Default=Disabled.DOCUMENT_UNDERSTANDING_LOG_FILE_LEVEL(Environment Variable): Minimum level for file logging (e.g.,DEBUG,INFO). Default=INFO.DOCUMENT_UNDERSTANDING_DEFAULT_LANG(Environment Variable): Sets the default OCR language (e.g.,eng,fra). Defaults toengif not set.ENABLE_SAVE_IMAGES_TO_FILES(Environment Variable): When set totrue, enables saving extracted images to files when using theextract_imagestool with theoutput_directoryparameter. Default=false.SAFE_OUTPUT_DIRECTORIES(Environment Variable): Colon-separated list of directories where images can be saved whenENABLE_SAVE_IMAGES_TO_FILES=truebutALLOW_ANY_PATH=false. For example:/tmp:/var/output:/data/images.- Capability Override / Enable Flags (Command-line):
--allow-no-java: If the Java runtime (required forextract_tables) is not found, the server will normally exit. Use this flag to allow startup, but theextract_tablestool will be disabled.--allow-no-tesseract: If the Tesseract executable (required for OCR fallback) is not found, the server will normally exit. Use this flag to allow startup, but OCR fallback inextract_pdf_contentswill be disabled.--enable-experimental: Enables experimental tools (likefind_nearby_content) which are disabled by default.--allow-any-path: (SECURITY RISK) If set, disables thePDF_SERVER_BASE_PATHrestriction and allows the server to attempt reading files from any path provided in tool arguments. Use with extreme caution and only in trusted environments.
The availability or full functionality of certain tools depends on external dependencies or flags.
| Tool Name | Capability | Dependency / Trigger | Check Performed At | Control Flag |
|---|---|---|---|---|
extract_tables |
java_runtime |
Java Runtime (JRE) | Server Startup | --allow-no-java |
extract_pdf_contents |
tesseract_ocr (Opt.) |
Tesseract OCR | Server Startup | --allow-no-tesseract |
| (Other Tools) | (None) | Python Packages | Installation | (N/A) |
*Note: If Tesseract OCR is unavailable (and --allow-no-tesseract is used), OCR fallback is disabled.
The server implements the following tools. The list presented to the LLM via the list_tools call is dynamic and depends on detected capabilities (e.g., Java availability) and startup flags.
See src/document_understanding/models.py for response schemas (e.g., TextContentResponse, MetadataResponse, etc.). Error responses follow the ErrorResponse schema.
Guidance: The pdf_path argument in all tools must refer to a file path accessible within the server's environment.
- Description: Extracts text content from specified pages of a local PDF file. Uses direct text extraction with OCR fallback for scanned or image-heavy pages.
- Arguments:
pdf_path(string, required): Path to the local PDF file.pages(string, optional): Page spec (1-based, ranges, neg indices). E.g.,"1, 3-5, -1". Default=all.ocr_language(string, optional): OCR language(s) for Tesseract (e.g.,"eng","fra+eng"). Default='eng'.password(string, optional): Password for encrypted PDFs.
- Returns:
TextContentResponseJSON. - Guidance: Best for getting the raw text of pages. Handles OCR automatically if needed. Use
pagesfor specific sections. If dealing with non-English scanned text, setocr_language.
- Description: Extracts metadata (author, title, dates) and checks for images/drawings in a PDF.
- Arguments:
pdf_path(string, required): Path to the local PDF file.password(string, optional): Password for encrypted PDFs.
- Returns:
MetadataResponseJSON. - Guidance: Use this first to get an overview of the PDF structure and properties (page count, title, author, etc.) and a hint about whether images/drawings are present.
- Description: Searches for exact text occurrences within specified pages and returns bounding boxes.
- Arguments:
pdf_path(string, required): Path to the local PDF file.query(string, required): The text to search for (case-sensitive).pages(string, optional): Page spec. Default=all.password(string, optional): Password for encrypted PDFs.
- Returns:
SearchResponseJSON containing a list of matches with page number and rectangle coordinates. - Guidance: Useful for finding specific terms or phrases and their location.
- Description: Extracts detailed layout info: text blocks, drawings, image placements (with coordinates).
- Arguments:
pdf_path(string, required): Path to the local PDF file.pages(string, optional): Page spec. Default=all.include_images(boolean, optional): Include image info (xref, bbox, size). Default=False.include_drawings(boolean, optional): Include vector drawing info. Default=False.detail_level(string, optional): Text detail level ('blocks', 'lines', 'words'). Default='blocks'.password(string, optional): Password for encrypted PDFs.
- Returns:
LayoutResponseJSON containing layout details per page. - Guidance: Use when spatial relationships are important or to understand page structure. Provides coordinates for text, images, and vector drawings. Can generate large responses for complex pages.
- Description: Extracts info about images (raster images, some other objects like Form XObjects).
- Arguments:
pdf_path(string, required): Path to the local PDF file.pages(string, optional): Page spec. Default=all.include_data(boolean, optional): If true, include base64 image data. Default=false.min_width(integer, optional): Minimum image width to include in results.min_height(integer, optional): Minimum image height to include in results.filter_bbox(array, optional): Bounding box to filter images by [x0, y0, x1, y1].password(string, optional): Password for encrypted PDFs.output_directory(string, optional): Directory to save extracted images to. RequiresENABLE_SAVE_IMAGES_TO_FILES=trueenvironment variable.save_without_returning_data(boolean, optional): If true, save images to files without returning base64 data in response. Default=false.
- Returns:
ImageExtractionResponseJSON. - Guidance: Returns image dimensions, page number, and internal reference (
xref). Bounding box (bbox) is optional and may be missing if detection fails (e.g., for Form XObjects). Useinclude_data=Truecautiously as it can return very large responses. Whenoutput_directoryis specified andENABLE_SAVE_IMAGES_TO_FILES=true, images will be saved to disk and the response will include file paths. Usesave_without_returning_data=Trueto save images to files without including the potentially large base64 data in the response.
- Description: Extracts tables from specified PDF pages into lists of lists.
- Arguments:
pdf_path(string, required): Path to the local PDF file.pages(string, optional):'all'or comma-separated page numbers (1-based). Default=all.password(string, optional): Password for encrypted PDFs.
- Returns:
TableExtractionResponseJSON. (Note: Current implementation returns list of dicts, see docs/KNOWN_ISSUES.md). - Guidance: Best effort table detection using
tabula-py. Results depend heavily on table formatting (lines, spacing). May be slow. Note: This tool requires a Java runtime. If Java is not detected at server startup and the--allow-no-javaflag was used, this tool will be unavailable.
- Description: Detects the language(s) of text content sampled from specified pages (defaults to page 1).
- Arguments:
pdf_path(string, required): Path to the local PDF file.pages(string, optional): Page spec for sampling. Default='1'.sample_size(integer, optional): Max characters to sample. Default=2000.password(string, optional): Password for encrypted PDFs.
- Returns:
LanguageDetectionResponseJSON. - Guidance: Useful for determining the primary language before processing or translation.
- Description: Extracts the document outline (bookmarks/table of contents) from a PDF.
- Arguments:
pdf_path(string, required): Path to the local PDF file.password(string, optional): Password for encrypted PDFs.
- Returns:
OutlineExtractionResponseJSON containing the hierarchical outline structure. - Guidance: Use to extract the document's table of contents or bookmark structure. Returns an empty list if the document has no outline.
- Description: Returns the designated directory path where PDF files should be placed for processing.
- Arguments: None.
- Returns: JSON with
status,working_directory(string or null), and optionalmessage. - Guidance: Call this tool to find out where to upload/place PDF files before calling other tools that require a
pdf_path. If the server was started with--allow-any-path,working_directorywill be null.
The test suite uses dynamically generated PDFs for all tests. No static PDF files are stored in the repository. The test fixtures in tests/conftest.py generate all necessary test PDFs at runtime, including:
- Simple text documents
- Documents with images
- Documents with drawings
- Documents with tables
- Documents with outlines/bookmarks
- Empty documents
- Encrypted documents
This approach ensures that all tests can run without requiring external files and makes the test suite more portable and self-contained.
# Run all tests
python -m pytest
# Run specific test categories
python -m pytest tests/unit
python -m pytest tests/integration
# Run tests with coverage
python -m pytest --cov=srcThe test suite includes several types of tests:
- Unit Tests: Test individual components in isolation
- Integration Tests: Test the interaction between components
- End-to-End Tests: Test the entire system from the user's perspective
The end-to-end tests use the MCP command-line interface to interact with the server, simulating real-world usage. This approach ensures that the server works correctly in various scenarios and with different types of PDFs.
This server runs using the standard MCP stdio communication. There are multiple ways to run the server:
# Set required base path (if not using sandbox mode)
export DOCUMENT_UNDERSTANDING_BASE_PATH=/path/to/pdf/storage
# Run with default settings (sandbox mode)
bin/document-understanding-mcp-server
# Run with sandbox mode disabled (less restricted)
DOCUMENT_UNDERSTANDING_SANDBOX=false bin/document-understanding-mcp-server
# Run with additional arguments
bin/document-understanding-mcp-server --enable-experimental --port 8000The document-understanding-mcp-server executable automatically sets up a secure environment with:
- User-specific isolated directories in sandbox mode (default)
- Appropriate permissions based on security context
- Intelligent argument handling based on environment
When the package is installed via pip, you can use the entry point directly:
# Run with default settings
document-understanding-mcp-server
# Run with custom arguments
document-understanding-mcp-server --allow-any-path --enable-experimental# Activate environment
source .venv/bin/activate
# Set required base path
export DOCUMENT_UNDERSTANDING_BASE_PATH=/path/to/pdf/storage
# Run directly (will fail if Java/Tesseract missing and flags not used)
python standalone_server.py
# Run allowing missing Java/Tesseract but restricting path
python standalone_server.py --allow-no-java --allow-no-tesseract
# Run allowing ANY path access (UNSAFE)
# DOCUMENT_UNDERSTANDING_BASE_PATH is ignored here
python standalone_server.py --allow-any-path
# Enable experimental features
python standalone_server.py --enable-experimentalThe server supports a secure sandbox mode (enabled by default) that:
- Creates isolated user-specific directories for file output
- Sets stricter file permissions (700 - user access only)
- Disables the
--allow-any-pathflag for better security
To disable sandbox mode (not recommended for production):
export DOCUMENT_UNDERSTANDING_SANDBOX=false{
"mcpServers": {
"document-understanding": {
"command": "/path/to/document-understanding-mcp-server",
"args": [
"--enable-experimental"
]
}
}
}{
"mcpServers": {
"document-understanding": {
"command": "/path/to/your/.venv/bin/python",
"args": [
"-m", "document_understanding.cli"
],
"env": {
"DOCUMENT_UNDERSTANDING_BASE_PATH": "/path/to/pdf/storage"
}
}
}
}- File Paths: Before processing a PDF, use
get_pdf_working_directoryto find the allowed directory. Ensure anypdf_patharguments you provide refer to files within that directory (or subdirectories). Path traversal attempts (../) or absolute paths outside the working directory will be rejected unless the server was unsafely started with--allow-any-path. - Passwords: Currently, the server cannot handle password-protected PDFs for most tools (except table extraction). Processing encrypted files will likely fail. See docs/KNOWN_ISSUES.md.
- Tool Availability: The list of available tools might change depending on the server's environment (e.g., if Java is missing and
--allow-no-javawas used,extract_tableswill not be listed). Rely on the response fromlist_toolsto know what is currently usable. - Use specific tools: Prefer
search_pdf_textover extracting all content if you only need specific occurrences. Useextract_pdf_metadataif only metadata is needed. - Handle JSON: Expect JSON responses. Parse the
textfield of theToolResponseContentas JSON. - Check
status: Always check thestatusfield in the JSON response. If it's"error", use themessageanderror_codefields for diagnostics (Note: errors are typically raised as exceptions by the server, the MCP client translates these). - Layout Information: The
extract_pdf_layouttool provides rich structural information (text coordinates, drawings, image locations). This can be used for complex analysis (e.g., withfind_nearby_content) but generates large responses. - Page Specification: Use the
pagesparameter effectively (e.g., "1-3, 5, -1") to limit processing when needed. - OCR Language: If processing non-English documents, specify the language(s) using
ocr_language(e.g.,"fra+eng"for French and English) for better accuracy. - Error Handling: Be prepared for
FileNotFoundErrorif thepdf_pathis incorrect, orValueErrorfor invalid page specifications or processing issues. Encrypted files may cause various errors. The server raises exceptions which the MCP client should surface.
-
Extracting French Text from Scanned Pages 2 & 3:
get_pdf_working_directory-> Check response forworking_directory.- (LLM/User ensures
mydoc.pdfis in the working directory) extract_pdf_contentswithpdf_path="mydoc.pdf",pages="2-3",ocr_language="fra".- Parse response, process text from the
pageslist.
-
Finding Text near an Image:
get_pdf_working_directory-> Getworking_directory.- (LLM/User places
report.pdf) extract_pdf_layoutwithpdf_path="report.pdf",pages="5"(assuming image is on page 5).- Parse layout response. Identify the desired image based on
xref, size, or roughbbox. find_nearby_content(if enabled) withpdf_path="report.pdf",pages="5",target_bboxset to the image's bbox,content_type="text_block",direction="below"(e.g., to find caption).- Parse nearby response for the caption text.
-
Extracting Specific Tables after Metadata Check:
get_pdf_working_directory-> Getworking_directory.- (LLM/User places
data.pdf) extract_pdf_metadatawithpdf_path="data.pdf".- Check response: confirm
page_count, look at metadata fields. extract_tableswithpdf_path="data.pdf",pages="2, 4"(e.g., only pages 2 and 4).- Parse response.
- Extract Specific Pages:
- Call
extract_pdf_contentswithpdf_pathandpages="2-3". - Parse JSON response, check
status, process the extracted content.
- Call
For a list of current known issues and limitations, see docs/KNOWN_ISSUES.md.
For a list of previously resolved issues, see docs/SOLVED_ISSUES.md.
For information about planned enhancements and future development, see the plans directory.
For information on how to contribute to this project, please see the CONTRIBUTING.md file.
This project is licensed under the terms specified in the LICENSE file.