🖍️ LlaMarker

Your go-to tool for converting and parsing documents into clean, well-structured Markdown!
Fast, intuitive, and entirely local 💻🚀.

✨ Key Features

✨ All-in-One Parsing
Supports TXT, DOCX, PDF, PPT, XLSX, and more—even processes images inside documents.
🖼️ Visual Content Extraction
Utilizes Llama 3.2 Vision to detect images, tables, charts, and diagrams, converting them into rich Markdown.
🏗️ Built with Marker
Extends the open-source Marker parser to handle complex content types locally.
🛡️ Local-First Privacy
No cloud, no external servers—all processing happens on your machine.

🚀 How It Works

Parsing & Conversion
- Parses and converts multiple file types (.txt, .docx, .pdf, .ppt, .xlsx, etc.) into Markdown.
- Leverages Marker for accurate and efficient parsing of both text and visual elements.
- Extracts images, charts, and tables, embedding them in Markdown.
- (Optional) Converts documents into PDFs using LibreOffice for easy viewing.
Visual Analysis
- Distinguishes logos from content-rich images.
- Extracts and preserves the original language from images.
- Uses multiple agents to extract useful information from the images.
Fast & Efficient
- Supports parallel processing for faster handling of large folders.
Streamlit GUI
- A user-friendly interface to upload and parse files (or multiple files at once!) or entire directories.
- Download results directly from the GUI.

📑 Table of Contents

Features
Prerequisites
Installation Options
- Install via PyPI
- Local Development Setup
Basic Usage
- CLI Usage
- Streamlit GUI
Advanced Usage
- Command-Line Arguments
- Example Commands
Output Structure
Code Example
Contributing
License
Acknowledgments

✨ Features

📄 Document Conversion
Converts .txt, .docx, and other supported file types into .pdf using LibreOffice (optional if you only need to parse PDFs).
📊 Page Counting
Automatically counts pages in PDFs using PyPDF2.
🖼️ Image Processing
Analyzes images to differentiate logos from content-rich images. Extracts relevant data and updates the corresponding Markdown file.
✍️ Markdown Parsing
Uses Marker to generate clean, structured Markdown files from parsed PDFs.
🌐 Multilingual Support
Maintains the original language of the content during extraction.
📈 Data Visualization
Generates analysis plots based on the page counts of processed documents.

⚙️ Prerequisites

Before installing or running LlaMarker, please ensure you meet the following requirements:

Python 3.10+
- Core language for running LlaMarker.
- Verify your Python version:
```
python --version
```
Marker
- Marker is an open-source parser that LlaMarker extends.
- To install Marker, follow these steps:
  1. Clone the repository:
```
git clone https://github.com/VikParuchuri/marker.git
cd marker
```
  2. Install Marker in editable mode:
```
pip install -e .
```
  3. Verify the installation:
```
marker --help
```
- GPU Support: If you plan to leverage GPUs, ensure PyTorch is installed with CUDA support (e.g., via pytorch-cuda or the official PyTorch distribution).
- Path Configuration: If Marker is not in your PATH, ensure you specify its location with the --marker_path argument.
LibreOffice
- Required for converting .docx, .ppt, .xlsx, etc., into .pdf before parsing.
- Linux (Ubuntu/Debian example):
```
sudo apt update
sudo apt install libreoffice
```
- Windows:
  Download the installer and consider adding LibreOffice to your system PATH.
- macOS:
  - Download from LibreOffice’s website or
  - Use Homebrew:
```
brew install --cask libreoffice
```
Ollama & Vision Models
- Install Ollama for your OS.
- Pull the required model:
```
ollama pull llama3.2-vision
```
- Test run to ensure your model is set up correctly.
Poetry (for local development only)
- Recommended dependency manager if you’re cloning the repository to develop or modify LlaMarker.
- Linux/Mac:
```
curl -sSL https://install.python-poetry.org | python3 -
# (If not added to PATH automatically)
export PATH="$HOME/.local/bin:$PATH"
```
- macOS (Homebrew):
```
brew install poetry
```
- Windows:
  Follow instructions on Poetry’s official site.

🚀 Installation Options

1. Install via PyPI

The simplest approach—ideal if you just want to use LlaMarker rather than develop it:

pip install llamarker

Requires: Python 3.10+
After installing, you have access to two main commands:
1. llamarker — CLI tool.
2. llamarker_gui — Streamlit-based GUI for interactive use.

Note: LibreOffice, Marker, and any optional OCR components need to be installed separately, if you plan to use their respective features.

2. Local Development Setup

If you plan to contribute or dive into the source code:

Clone the repository:

git clone https://github.com/RevanKumarD/LlaMarker.git
cd LlaMarker

Install dependencies using Poetry:
```
poetry install
```

Run LlaMarker locally:

CLI:

poetry run python llamarker/llamarker.py --directory <directory_path>

GUI:

poetry run streamlit run llamarker/llamarker_gui.py

No requirements.txt is provided; Poetry is the recommended (and supported) method for local development.

📌 Basic Usage

CLI Usage

Installed via PyPI

Process a folder:
```
llamarker --directory <directory_path>
```
Process a single file:
```
llamarker --file <file_path>
```

Local Development

CLI:

poetry run python llamarker/llamarker.py --directory <directory_path>

Streamlit GUI

A user-friendly interface to upload files/directories, parse them, and download results.

Installed via PyPI:
```
llamarker_gui
```

Local Development:

poetry run streamlit run llamarker/llamarker_gui.py

Open the link (e.g., http://localhost:8501) in your browser to start using LlaMarker via GUI.

🔧 Advanced Usage

Command-Line Arguments

Argument	Description
`--directory`	Root directory containing documents to process.
`--file`	Path to a single file to process (optional).
`--temp_dir`	Temporary directory for intermediate files (optional).
`--save_pdfs`	Flag to save PDFs in a separate directory (`PDFs`) under the root directory.
`--output`	Directory to save output files (optional). By default, parsed Markdown files are stored in `ParsedFiles` and images go under `ParsedFiles/pics`.
`--marker_path`	Path to the Marker executable (optional). Auto-detects if `Marker` is in your `PATH`.
`--force_ocr`	Force OCR on all pages, even if text is extractable. Useful for poorly formatted PDFs or PPTs.
`--languages`	Comma-separated list of languages for OCR (default: `"en"`).
`--qa_evaluator`	Enable QA Evaluator for selecting the best response during image processing.
`--verbose`	Set verbosity level: 0 = WARNING, 1 = INFO, 2 = DEBUG (default: 0).
`--model`	Ollama model for image analysis (default: `llama3.2-vision`). A local vision model is required for this to work.

Example Commands

Directory processing:

llamarker --directory /path/to/documents

Single file with verbose output:

llamarker --file /path/to/document.docx --verbose 2

Parsing with OCR in multiple languages:

llamarker --directory /path/to/docs --force_ocr --languages "en,de,fr"

Save parsed PDFs to a custom folder:

llamarker --directory /path/to/docs --save_pdfs --output /path/to/output

Output Structure

After processing, LlaMarker organizes files as follows:

ParsedFiles
- Contains the generated Markdown files.
- pics — subfolder for extracted images.
PDFs
- Stores converted PDF files (if --save_pdfs is used).
OutDir
- Contains processed PDF files (used by the GUI).
logs
- Holds log files for each run (processing status, errors, etc.).

Code Example

For local development, you can programmatically use LlaMarker:

from llamarker import LlaMarker

llamarker = LlaMarker(
    input_dir="/path/to/documents",
    save_pdfs=True,
    output_dir="/path/to/output",
    verbose=1
)

# Process all documents in the specified directory
llamarker.process_documents()

# Generate summary info
results = llamarker.generate_summary()
for file, page_count in results:
    print(f"{file}: {page_count} pages")

# Generate analysis plots
llamarker.plot_analysis(llamarker.parent_dir)

🚧 Shortcomings & Future Updates

Current Shortcomings:

Limited OCR Accuracy for Complex Documents
- While OCR works well for most cases, it may struggle with highly complex layouts or poorly scanned documents.
No Direct Cloud Integration
- Currently, LlaMarker only supports local processing. There’s no option to process files directly from cloud storage services like Google Drive or Dropbox.
Basic Support for PPT and XLSX Parsing
- Parsing of PPT and XLSX files is available but lacks advanced formatting support (e.g., slide layouts, complex charts).
Poor XLSX to PDF Conversion
- The current conversion of XLSX files to PDF results in poorly formatted output. Improvements are needed to handle large spreadsheets and complex tables.
Manual Setup for Marker and LibreOffice
- Users must manually install Marker and LibreOffice, which can be cumbersome for those unfamiliar with the setup process.

Planned Future Updates:

Enhanced OCR Capabilities
- Improve OCR performance by integrating additional vision models for better handling of complex document layouts and multi-column formats.
Cloud Storage Integration
- Add support for uploading documents directly from cloud services (Google Drive, Dropbox, OneDrive).
Improved PPT & XLSX Handling
- Enhance parsing accuracy for PPT and XLSX files by adding better support for slides, tables, and embedded charts.
Better XLSX to PDF Conversion
- Improve the XLSX to PDF conversion process to handle large sheets, complex tables, and maintain proper formatting.
Cross-Platform Installation Script
- Provide an easy-to-use installation script for all platforms (Linux, Windows, macOS) to automate the setup of dependencies like Marker and LibreOffice.

Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request. Let’s make LlaMarker even more powerful—together. 🤝

License

This project references the Marker repository, which comes with its own license. Please review the Marker repo for licensing restrictions and guidelines.

Acknowledgments

Huge thanks to the Marker project for providing an excellent foundation for parsing.
Special thanks to the open-source community for continuous support and contributions.

Happy Parsing! 🌟