Typst Mathematical Expression OCR based on TrOCR.
Clone this repo and enter it:
git clone https://github.com/ParaN3xus/typress
cd typress
Install dependencies:
We use Poetry to manage project dependencies. If you don't have Poetry installed, please follow the instructions on the Poetry installation page.
poetry install
poetry shell
Ensure you are in the repo root directory and execute
python -m typress
Create a .env file in the repo root directory with the following content:
MODEL_PATH=path/to/your/model
API_ROOT_URL=https://api.example.com/typress
To run the application in production mode, it is recommended to use a production-grade WSGI server such as gunicorn
:
gunicorn --bind 0.0.0.0:8000 wsgi:app
- Improve the tex2typ reconstruction strategy for
spacing
. - Fix memory leaks in normalized formulas
- Fix memory leaks in formula detection
- Add formula detection
- Explore using LoRA to fine-tune the OCR model for TeX
- Publish to PyPI
- Document the complete dataset construction process
-
Train usingseq2seqtrainer
If you have a collection of Typst mathematical formula text (which can be included in Typst documents), you can create a dataset by running the following command in the Typst workspace root:
python -m typress.dataset extract
Then, submit the generated out.json
file to us via email at paran3xus007@gmail.com. By submitting your data to us, you agree to make your dataset publicly available.
We welcome any code contributions, including bug fixes, feature additions, etc. If you're unsure where to start, you can refer to our Todo list.
This repository is published under an MIT License. See LICENSE file
This project makes use of the following open-source projects or datasets:
- TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models.
- tramsformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
- evaluate: A library for easily evaluating machine learning models and datasets.
- eq_query_rec: Query equations from Typst source file and reconstruct normalized equation from querying result.
- typst.ts: Run Typst in JavaScriptWorld.
- texteller_det: Formula detection.
- fusion-image-to-latex-datasets: The largest dataset to date from online sources.
- latex-formulas: TexTeller previous dataset.
Thanks to the developers and contributors of these projects for their hard work and dedication.
Thanks to sjfhsjfh, Naptie, Mivik for providing Typst mathematical formula data.