/TexTeller

TexTeller can convert image to latex formulas (image2latex, latex OCR) with higher accuracy and exhibits superior generalization ability, enabling it to cover most usage scenarios.

Primary LanguagePythonApache License 2.0Apache-2.0

๐Ÿ“„ English | ไธญๆ–‡

๐šƒ๐šŽ๐šก๐šƒ๐šŽ๐š•๐š•๐šŽ๐š›

demo.mp4

TexTeller is an end-to-end formula recognition model based on TrOCR, capable of converting images into corresponding LaTeX formulas.

TexTeller was trained with 80M image-formula pairs (previous dataset can be obtained here), compared to LaTeX-OCR which used a 100K dataset, TexTeller has stronger generalization abilities and higher accuracy, covering most use cases.

Note

If you would like to provide feedback or suggestions for this project, feel free to start a discussion in the Discussions section.

Additionally, if you find this project helpful, please don't forget to give it a starโญ๏ธ๐Ÿ™๏ธ


๐Ÿ”– Table of Contents

Images that can be recognized by TexTeller

Thanks to the Super Computing Platform of Beijing University of Posts and Telecommunications for supporting this work๐Ÿ˜˜

๐Ÿ”„ Change Log

  • ๐Ÿ“ฎ[2024-06-06] TexTeller3.0 released! The training data has been increased to 80M (10x more than TexTeller2.0 and also improved in data diversity). TexTeller3.0's new features:

    • Support scanned image, handwritten formulas, English(Chinese) mixed formulas.

    • OCR abilities in both Chinese and English for printed images.

  • ๐Ÿ“ฎ[2024-05-02] Support paragraph recognition.

  • ๐Ÿ“ฎ[2024-04-12] Formula detection model released!

  • ๐Ÿ“ฎ[2024-03-25] TexTeller2.0 released! The training data for TexTeller2.0 has been increased to 7.5M (15x more than TexTeller1.0 and also improved in data quality). The trained TexTeller2.0 demonstrated superior performance in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.

    Here are more test images and a horizontal comparison of various recognition models.

๐Ÿš€ Getting Started

  1. Clone the repository:

    git clone https://github.com/OleehyO/TexTeller
  2. Install the project's dependencies:

    pip install texteller
  3. Enter the src/ directory and run the following command in the terminal to start inference:

    python inference.py -img "/path/to/image.{jpg,png}" 
    # use --inference-mode option to enable GPU(cuda or mps) inference
    #+e.g. python inference.py -img "img.jpg" --inference-mode cuda

    The first time you run it, the required checkpoints will be downloaded from Hugging Face.

Paragraph Recognition

As demonstrated in the video, TexTeller is also capable of recognizing entire text paragraphs. Although TexTeller has general text OCR capabilities, we still recommend using paragraph recognition for better results:

  1. Download the weights of the formula detection model to thesrc/models/det_model/model/directory

  2. Run inference.py in the src/ directory and add the -mix option, the results will be output in markdown format.

    python inference.py -img "/path/to/image.{jpg,png}" -mix

TexTeller uses the lightweight PaddleOCR model by default for recognizing both Chinese and English text. You can try using a larger model to achieve better recognition results for both Chinese and English:

Checkpoints Model Description Size
ch_PP-OCRv4_det.onnx Default detection model, supports Chinese-English text detection 4.70M
ch_PP-OCRv4_server_det.onnx High accuracy model, supports Chinese-English text detection 115M
ch_PP-OCRv4_rec.onnx Default recoginition model, supports Chinese-English text recognition 10.80M
ch_PP-OCRv4_server_rec.onnx High accuracy model, supports Chinese-English text recognition 90.60M

Place the weights of the recognition/detection model in the det/ or rec/ directories within src/models/third_party/paddleocr/checkpoints/, and rename them to default_model.onnx.

Note

Paragraph recognition cannot restore the structure of a document, it can only recognize its content.

๐ŸŒ Web Demo

Go to the src/ directory and run the following command:

./start_web.sh

Enter http://localhost:8501 in a browser to view the web demo.

Note

  1. For Windows users, please run the start_web.bat file.
  2. When using onnxruntime + GPU for inference, you need to install onnxruntime-gpu.

๐Ÿ” Formula Detection

TexTellerโ€™s formula detection model is trained on 3,415 images of Chinese educational materials (with over 130 layouts) and 8,272 images from the IBEM dataset, and it supports formula detection across entire images.

  1. Download the model weights and place them in src/models/det_model/model/ [link].

  2. Run the following command in the src/ directory, and the results will be saved in src/subimages/

Advanced: batch formula recognition

After formula detection, run the following command in the src/ directory:

python rec_infer_from_crop_imgs.py

This will use the results of the previous formula detection to perform batch recognition on all cropped formulas, saving the recognition results as txt files in src/results/.

๐Ÿ“ก API Usage

We use ray serve to provide an API interface for TexTeller, allowing you to integrate TexTeller into your own projects. To start the server, you first need to enter the src/ directory and then run the following command:

python server.py
Parameter Description
-ckpt The path to the weights file,default is TexTeller's pretrained weights.
-tknz The path to the tokenizer,default is TexTeller's tokenizer.
-port The server's service port,default is 8000.
--inference-mode Whether to use "cuda" or "mps" for inference,default is "cpu".
--num_beams The number of beams for beam search,default is 1.
--num_replicas The number of service replicas to run on the server,default is 1 replica. You can use more replicas to achieve greater throughput.
--ncpu_per_replica The number of CPU cores used per service replica,default is 1.
--ngpu_per_replica The number of GPUs used per service replica,default is 1. You can set this value between 0 and 1 to run multiple service replicas on one GPU to share the GPU, thereby improving GPU utilization. (Note, if --num_replicas is 2, --ngpu_per_replica is 0.7, then 2 GPUs must be available)
-onnx Perform inference using Onnx Runtime, disabled by default

Note

A client demo can be found at src/client/demo.py, you can refer to demo.py to send requests to the server

๐Ÿ‹๏ธโ€โ™‚๏ธ Training

Dataset

We provide an example dataset in the src/models/ocr_model/train/dataset/ directory, you can place your own images in the images/ directory and annotate each image with its corresponding formula in formulas.jsonl.

After preparing your dataset, you need to change the DIR_URL variable to your own dataset's path in **/train/dataset/loader.py

Retraining the Tokenizer

If you are using a different dataset, you might need to retrain the tokenizer to obtain a different vocabulary. After configuring your dataset, you can train your own tokenizer with the following command:

  1. In src/models/tokenizer/train.py, change new_tokenizer.save_pretrained('./your_dir_name') to your custom output directory

    If you want to use a different vocabulary size (default 15K), you need to change the VOCAB_SIZE variable in src/models/globals.py

  2. In the src/ directory, run the following command:

    python -m models.tokenizer.train

Training the Model

  1. Modify num_processes in src/train_config.yaml to match the number of GPUs available for training (default is 1).

  2. In the src/ directory, run the following command:

    accelerate launch --config_file ./train_config.yaml -m models.ocr_model.train.train

You can set your own tokenizer and checkpoint paths in src/models/ocr_model/train/train.py (refer to train.py for more information). If you are using the same architecture and vocabulary as TexTeller, you can also fine-tune TexTeller's default weights with your own dataset.

In src/globals.py and src/models/ocr_model/train/train_args.py, you can change the model's architecture and training hyperparameters.

Note

Our training scripts use the Hugging Face Transformers library, so you can refer to their documentation for more details and configurations on training parameters.

๐Ÿ“… Plans

  • Train the model with a larger dataset
  • Recognition of scanned images
  • Support for English and Chinese scenarios
  • Handwritten formulas support
  • PDF document recognition
  • Inference acceleration
  • ...

โญ๏ธ Stargazers over time

Stargazers over time

๐Ÿ‘ฅ Contributors