/typress

Typst Mathematical Expression OCR

Primary LanguagePythonMIT LicenseMIT

Typress

Open Source License Hugging Face Weights Hugging Face Spaces

Typst Mathematical Expression OCR based on TrOCR.

Install

Clone the Repository

Clone this repo and enter it:

git clone https://github.com/ParaN3xus/typress
cd typress

Install dependencies:

We use Poetry to manage project dependencies. If you don't have Poetry installed, please follow the instructions on the Poetry installation page.

poetry install
poetry shell

TODO: PyPI

Run

Development Run

Run Typress Web server

Ensure you are in the repo root directory and execute

python -m typress

Production Run

Set Up .env

Create a .env file in the repo root directory with the following content:

MODEL_PATH=path/to/your/model
API_ROOT_URL=https://api.example.com/typress

Run WSGI

To run the application in production mode, it is recommended to use a production-grade WSGI server such as gunicorn:

gunicorn --bind 0.0.0.0:8000 wsgi:app

TODO

  • Improve the tex2typ reconstruction strategy for spacing.
  • Fix memory leaks in normalized formulas
  • Add formula detection
  • Explore using LoRA to fine-tune the OCR model for TeX
  • Publish to PyPI
  • Document the complete dataset construction process
  • Train using seq2seqtrainer

Contributing

Data Contribution

If you have a collection of Typst mathematical formula text (which can be included in Typst documents), you can create a dataset by running the following command in the Typst workspace root:

python -m typress.dataset extract

Then, submit the generated out.json file to us via email at paran3xus007@gmail.com. By submitting your data to us, you agree to make your dataset publicly available.

Code Contribution

We welcome any code contributions, including bug fixes, feature additions, etc. If you're unsure where to start, you can refer to our Todo list.

License

This repository is published under an MIT License. See LICENSE file

Credits

This project makes use of the following open-source projects or datasets:

  • TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models.
  • tramsformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
  • evaluate: A library for easily evaluating machine learning models and datasets.
  • eq_query_rec: Query equations from Typst source file and reconstruct normalized equation from querying result.
  • typst.ts: Run Typst in JavaScriptWorld.
  • fusion-image-to-latex-datasets: The largest dataset to date from online sources.
  • latex-formulas: TexTeller previous dataset.

Thanks to the developers and contributors of these projects for their hard work and dedication.

Thanks to sjfhsjfh, Naptie, Mivik for providing Typst mathematical formula data.