Github π€ Hugging Face π Paper ποΈ Data π½οΈ Demo
Arabic-Nougat is a suite of Optical Character Recognition (OCR) models designed to extract structured text in Markdown format from Arabic book pages. This repository provides tools for fine-tuning, evaluation, dataset preparation, and tokenizer analysis for the Arabic-Nougat models, which build on Metaβs Nougat architecture.
Arabic-Nougat is tailored to process Arabic text, handling the unique challenges of the script, such as its cursive nature and contextual letter forms. It extends Metaβs Nougat OCR model with custom enhancements:
- Advanced Tokenization: Includes the
Aranizer-PBE-86k
tokenizer, optimized for Arabic text. - Extended Context Length: Supports up to 8192 tokens, suitable for processing lengthy documents.
- Dataset: Uses the synthetic
arabic-img2md
dataset, designed to train models for Markdown extraction.
- Fine-tune the model on custom datasets.
- Evaluate models using standard metrics like BLEU, CER, and WER.
- Generate synthetic PDFs and Markdown from HTML for dataset creation.
- Analyze tokenizer performance and efficiency.
- Pretrained models:
arabic-small-nougat
,arabic-base-nougat
, andarabic-large-nougat
.
βββ eval_model.py # Evaluate Arabic-Nougat models
βββ finetune_nougat.py # Fine-tune the Nougat model
βββ pdf_generation.py # Generate synthetic PDFs and Markdown
βββ tokenizer_ratios.py # Analyze tokenizer performance
βββ try_nougat.py # Test the model on a sample image
-
Python 3.8 or higher.
-
A machine with multiple GPUs for fine-tuning and evaluation.
-
Install the required Python libraries:
pip install transformers datasets evaluate weasyprint html2text pdf2image pillow tabulate bidi arabic-reshaper colorama filelock tqdm
-
Install system dependencies for
weasyprint
:sudo apt install libpango1.0-dev libpangocairo-1.0-0
-
Ensure that you have
huggingface-cli
installed and logged in:pip install huggingface-hub huggingface-cli login
Evaluate the performance of Arabic-Nougat models on a dataset
python eval_model.py
- Metrics calculated:
- BLEU
- Character Error Rate (CER)
- Word Error Rate (WER)
- Markdown Structure Accuracy (custom metric, check the paper to learn more)
Note: Won't produce the same numbers present in the paper as the eval dataset is not open.
Fine-tune the Nougat model on the arabic-img2md
dataset:
accelerate launch --multi_gpu --num_processes 4 finetune_nougat.py
- Configurations:
- Context length: 8192 tokens.
- Data collators and efficient gradient accumulation.
Create synthetic datasets from HTML content:
python pdf_generation.py
- Converts HTML to PDFs and Markdown for training datasets.
- Configurable fonts, sizes, and page layouts for diversity.
Compare tokenization efficiency between different tokenizers:
python tokenizer_ratios.py
- Outputs the average tokenization ratio between models like
arabic-large-nougat
andarabic-base-nougat
.
Test the model on a sample image:
python try_nougat.py
- Input: Path to an image of a book page (there is a default value in the script).
- Output: Extracted Markdown text.
The following pretrained models are available:
- arabic-small-nougat
- Fine-tuned from
facebook/nougat-small
. - Smaller context length (2048 tokens).
- Fine-tuned from
- arabic-base-nougat
- Fine-tuned from
facebook/nougat-base
. - Context length: 4096 tokens.
- Fine-tuned from
- arabic-large-nougat
- Built with
Aranizer-PBE-86k
tokenizer. - Extended context length: 8192 tokens.
- Built with
Access the models on Hugging Face.
- A synthetic dataset of 13.7k samples containing:
- Input: Images of Arabic book pages.
- Output: Corresponding Markdown text.
Dataset available on Hugging Face.
This project is released under the Creative Commons Attribution-ShareAlike (CC BY-SA) license. Feel free to use, share, and adapt the work, provided proper attribution is given and adaptations are shared under the same terms.