Your go-to tool for converting and parsing documents into clean, well-structured Markdown!
Fast, intuitive, and entirely local 💻🚀.
-
✨ All-in-One Parsing
Supports TXT, DOCX, PDF, PPT, XLSX, and more—even processes images inside documents. -
🖼️ Visual Content Extraction
Utilizes Llama 3.2 Vision to detect images, tables, charts, and diagrams, converting them into rich Markdown. -
🏗️ Built with Marker
Extends the open-source Marker parser to handle complex content types locally. -
🛡️ Local-First Privacy
No cloud, no external servers—all processing happens on your machine.
-
Parsing & Conversion
- Parses and converts multiple file types (.txt, .docx, .pdf, .ppt, .xlsx, etc.) into Markdown.
- Leverages Marker for accurate and efficient parsing of both text and visual elements.
- Extracts images, charts, and tables, embedding them in Markdown.
- (Optional) Converts documents into PDFs using LibreOffice for easy viewing.
-
Visual Analysis
- Distinguishes logos from content-rich images.
- Extracts and preserves the original language from images.
- Uses multiple agents to extract useful information from the images.
-
Fast & Efficient
- Supports parallel processing for faster handling of large folders.
-
Streamlit GUI
- A user-friendly interface to upload and parse files (or multiple files at once!) or entire directories.
- Download results directly from the GUI.
- Features
- Prerequisites
- Installation Options
- Basic Usage
- Advanced Usage
- Output Structure
- Code Example
- Contributing
- License
- Acknowledgments
-
📄 Document Conversion
Converts.txt
,.docx
, and other supported file types into.pdf
using LibreOffice (optional if you only need to parse PDFs). -
📊 Page Counting
Automatically counts pages in PDFs using PyPDF2. -
🖼️ Image Processing
Analyzes images to differentiate logos from content-rich images. Extracts relevant data and updates the corresponding Markdown file. -
✍️ Markdown Parsing
Uses Marker to generate clean, structured Markdown files from parsed PDFs. -
🌐 Multilingual Support
Maintains the original language of the content during extraction. -
📈 Data Visualization
Generates analysis plots based on the page counts of processed documents.
Before installing or running LlaMarker, please ensure you meet the following requirements:
-
Python 3.10+
- Core language for running LlaMarker.
- Verify your Python version:
python --version
-
Marker
- Marker is an open-source parser that LlaMarker extends.
- To install Marker, follow these steps:
- Clone the repository:
git clone https://github.com/VikParuchuri/marker.git cd marker
- Install Marker in editable mode:
pip install -e .
- Verify the installation:
marker --help
- Clone the repository:
- GPU Support: If you plan to leverage GPUs, ensure PyTorch is installed with CUDA support (e.g., via
pytorch-cuda
or the official PyTorch distribution). - Path Configuration: If Marker is not in your
PATH
, ensure you specify its location with the--marker_path
argument.
-
LibreOffice
- Required for converting
.docx
,.ppt
,.xlsx
, etc., into.pdf
before parsing. - Linux (Ubuntu/Debian example):
sudo apt update sudo apt install libreoffice
- Windows:
Download the installer and consider adding LibreOffice to your systemPATH
. - macOS:
- Download from LibreOffice’s website or
- Use Homebrew:
brew install --cask libreoffice
- Required for converting
-
Ollama & Vision Models
- Install Ollama for your OS.
- Pull the required model:
ollama pull llama3.2-vision
- Test run to ensure your model is set up correctly.
-
Poetry (for local development only)
- Recommended dependency manager if you’re cloning the repository to develop or modify LlaMarker.
- Linux/Mac:
curl -sSL https://install.python-poetry.org | python3 - # (If not added to PATH automatically) export PATH="$HOME/.local/bin:$PATH"
- macOS (Homebrew):
brew install poetry
- Windows:
Follow instructions on Poetry’s official site.
The simplest approach—ideal if you just want to use LlaMarker rather than develop it:
pip install llamarker
- Requires: Python 3.10+
- After installing, you have access to two main commands:
llamarker
— CLI tool.llamarker_gui
— Streamlit-based GUI for interactive use.
Note: LibreOffice, Marker, and any optional OCR components need to be installed separately, if you plan to use their respective features.
If you plan to contribute or dive into the source code:
- Clone the repository:
git clone https://github.com/RevanKumarD/LlaMarker.git cd LlaMarker
- Install dependencies using Poetry:
poetry install
- Run LlaMarker locally:
- CLI:
poetry run python llamarker/llamarker.py --directory <directory_path>
- GUI:
poetry run streamlit run llamarker/llamarker_gui.py
- CLI:
No
requirements.txt
is provided; Poetry is the recommended (and supported) method for local development.
- Process a folder:
llamarker --directory <directory_path>
- Process a single file:
llamarker --file <file_path>
- CLI:
poetry run python llamarker/llamarker.py --directory <directory_path>
A user-friendly interface to upload files/directories, parse them, and download results.
- Installed via PyPI:
llamarker_gui
- Local Development:
poetry run streamlit run llamarker/llamarker_gui.py
Open the link (e.g., http://localhost:8501
) in your browser to start using LlaMarker via GUI.
Argument | Description |
---|---|
--directory |
Root directory containing documents to process. |
--file |
Path to a single file to process (optional). |
--temp_dir |
Temporary directory for intermediate files (optional). |
--save_pdfs |
Flag to save PDFs in a separate directory (PDFs ) under the root directory. |
--output |
Directory to save output files (optional). By default, parsed Markdown files are stored in ParsedFiles and images go under ParsedFiles/pics . |
--marker_path |
Path to the Marker executable (optional). Auto-detects if Marker is in your PATH . |
--force_ocr |
Force OCR on all pages, even if text is extractable. Useful for poorly formatted PDFs or PPTs. |
--languages |
Comma-separated list of languages for OCR (default: "en" ). |
--qa_evaluator |
Enable QA Evaluator for selecting the best response during image processing. |
--verbose |
Set verbosity level: 0 = WARNING, 1 = INFO, 2 = DEBUG (default: 0). |
--model |
Ollama model for image analysis (default: llama3.2-vision ). A local vision model is required for this to work. |
- Directory processing:
llamarker --directory /path/to/documents
- Single file with verbose output:
llamarker --file /path/to/document.docx --verbose 2
- Parsing with OCR in multiple languages:
llamarker --directory /path/to/docs --force_ocr --languages "en,de,fr"
- Save parsed PDFs to a custom folder:
llamarker --directory /path/to/docs --save_pdfs --output /path/to/output
After processing, LlaMarker organizes files as follows:
ParsedFiles
- Contains the generated Markdown files.
pics
— subfolder for extracted images.
PDFs
- Stores converted PDF files (if
--save_pdfs
is used).
- Stores converted PDF files (if
OutDir
- Contains processed PDF files (used by the GUI).
logs
- Holds log files for each run (processing status, errors, etc.).
For local development, you can programmatically use LlaMarker:
from llamarker import LlaMarker
llamarker = LlaMarker(
input_dir="/path/to/documents",
save_pdfs=True,
output_dir="/path/to/output",
verbose=1
)
# Process all documents in the specified directory
llamarker.process_documents()
# Generate summary info
results = llamarker.generate_summary()
for file, page_count in results:
print(f"{file}: {page_count} pages")
# Generate analysis plots
llamarker.plot_analysis(llamarker.parent_dir)
- Limited OCR Accuracy for Complex Documents
- While OCR works well for most cases, it may struggle with highly complex layouts or poorly scanned documents.
- No Direct Cloud Integration
- Currently, LlaMarker only supports local processing. There’s no option to process files directly from cloud storage services like Google Drive or Dropbox.
- Basic Support for PPT and XLSX Parsing
- Parsing of PPT and XLSX files is available but lacks advanced formatting support (e.g., slide layouts, complex charts).
- Poor XLSX to PDF Conversion
- The current conversion of XLSX files to PDF results in poorly formatted output. Improvements are needed to handle large spreadsheets and complex tables.
- Manual Setup for Marker and LibreOffice
- Users must manually install Marker and LibreOffice, which can be cumbersome for those unfamiliar with the setup process.
- Enhanced OCR Capabilities
- Improve OCR performance by integrating additional vision models for better handling of complex document layouts and multi-column formats.
- Cloud Storage Integration
- Add support for uploading documents directly from cloud services (Google Drive, Dropbox, OneDrive).
- Improved PPT & XLSX Handling
- Enhance parsing accuracy for PPT and XLSX files by adding better support for slides, tables, and embedded charts.
- Better XLSX to PDF Conversion
- Improve the XLSX to PDF conversion process to handle large sheets, complex tables, and maintain proper formatting.
- Cross-Platform Installation Script
- Provide an easy-to-use installation script for all platforms (Linux, Windows, macOS) to automate the setup of dependencies like Marker and LibreOffice.
Contributions are welcome! Feel free to open an issue or submit a pull request. Let’s make LlaMarker even more powerful—together. 🤝
This project references the Marker repository, which comes with its own license. Please review the Marker repo for licensing restrictions and guidelines.
© 2025 Revan Kumar Dhanasekaran. Released under the GPLv3 License.
- Huge thanks to the Marker project for providing an excellent foundation for parsing.
- Special thanks to the open-source community for continuous support and contributions.
Happy Parsing! 🌟