A open-source tool to convert Handwritten Math to LaTeX code!
From top to bottom: Ground truth, predicted LaTeX.
On test samples from the CRHOME 2013 handwritten digit competition:
From a tablet handwritten validation image:
- Handwriting Recognition: Uses a deep learning model trained on a large dataset of handwritten mathematical symbols and equations.
- LaTeX Conversion: Converts recognized handwriting into LaTeX code, ready to be used in your documents.
- Open-Source: All the details of our implementation can be found in this repository.
To get started with Math2LaTeX, you'll need to have Python 3.6+ and PyTorch 1.0+ installed. To finetune your own model, follow the instructions below to set up the dataset.
- Go to Kaggle's Handwritten Mathematical Expressions and download the dataset. Move
archive.zip
into theMath2LaTeX
directory. - Run the following:
conda create -n latexocr python==3.11
conda activate latexocr
pip install -r requirements.txt
bash ./setup.sh
You should see all checks passed
after running setup.sh
.
4. Images can be found in img_data
, and image name / label pairs are in img_data/labels.csv
.
- BLIP baseline.
- TrOCR experiments.
- Handwritten text data for evaluation.
- Pretrain on additional rendered latex data found at https://zenodo.org/api/records/56198/files-archive.
- RCNN + TrOCR segmentation-OCR pipeline.
- Model distillation and quantization.
- Rearrange code structure to a python package.
To begin, run train_TrOCR.ipynb
in scripts
. Scroll down to the "Validation on REAL Handwritten Digits" header to run the model on your own validation images.
Call python scripts/train_TrOCR.py
with the --gpu
flag to indicate which GPU to use. Default is 0.
Contributions to LaTeX-OCR are welcome! Whether it's bug reports, feature requests, or new code, we appreciate all help.