/formula-recognition-OCR

πŸ“œ (OCR) Recognizing LaTeX format text in the equation image

Primary LanguagePythonMIT LicenseMIT

Formula Image Latex Recognition

logo
Star Forks Issues License

πŸ“ Table of Contents


βž— Latex Recognition Task

Competition Overview

μˆ˜μ‹ 인식(Latex Recognition)은 μˆ˜μ‹ μ΄λ―Έμ§€μ—μ„œ LaTeX 포맷의 ν…μŠ€νŠΈλ₯Ό μΈμ‹ν•˜λŠ” νƒœμŠ€ν¬λ‘œ, 문자 인식(Character Recognition)κ³Ό 달리 μˆ˜μ‹ μΈμ‹μ˜ 경우 쒌 β†’ 우 뿐만 μ•„λ‹ˆλΌ Multi-line에 λŒ€ν•΄μ„œ μœ„ β†’ μ•„λž˜μ— λŒ€ν•œ μˆœμ„œ νŒ¨ν„΄ ν•™μŠ΅λ„ ν•„μš”ν•˜λ‹€λŠ” νŠΉμ§•μ„ κ°€μ§‘λ‹ˆλ‹€.


πŸ“ File Structure

Code Folder

ocr_teamcode/
β”‚
β”œβ”€β”€ config/                   # train argument config file
β”‚   β”œβ”€β”€ Attention.yaml
β”‚   └── SATRN.yaml
β”‚
β”œβ”€β”€ data_tools/               # utils for dataset
β”‚   β”œβ”€β”€ download.sh           # dataset download script
β”‚   β”œβ”€β”€ extract_tokens.py     # extract tokens from token.txt
β”‚   β”œβ”€β”€ make_dataset.py       # sample dataset
β”‚   β”œβ”€β”€ parse_upstage.py      # convert JSON ground truth file to ICDAR15 format
β”‚   └── train_test_split.py   # split dataset into train and test dataset
β”‚
β”œβ”€β”€ networks/                 # network, loss
β”‚   β”œβ”€β”€ Attention.py
β”‚   β”œβ”€β”€ SATRN.py
β”‚   └── loss.py
β”‚   └── spatial_transformation.py
β”‚
β”œβ”€β”€ checkpoint.py             # save, load checkpoints
β”œβ”€β”€ pre_processing.py         # preprocess images with OpenCV
β”œβ”€β”€ custom_augment.py         # image augmentations
β”œβ”€β”€ transform.py
β”œβ”€β”€ dataset.py
β”œβ”€β”€ flags.py                  # parse yaml to FLAG format
β”œβ”€β”€ inference.py              # inference
β”œβ”€β”€ metrics.py                # calculate evaluation metrics
β”œβ”€β”€ scheduler.py              # learning rate scheduler
β”œβ”€β”€ train.py                  # train
└── utils.py                  # utils for training

Dataset Folder

input/data/train_dataset
β”‚
β”œβ”€β”€ images/                 # input image folder
β”‚   β”œβ”€β”€ train_00000.jpg
β”‚   β”œβ”€β”€ train_00001.jpg
β”‚   β”œβ”€β”€ train_00002.jpg
β”‚   └── ...
|
β”œβ”€β”€ gt.txt                  # input data
β”œβ”€β”€ level.txt               # formula difficulty feature
β”œβ”€β”€ source.txt              # printed output / hand written feature
└── tokens.txt              # vocabulary for training

✨ Getting Started

Installation

pip install -r requirements.txt
  • scikit_image==0.14.1
  • opencv_python==3.4.4.19
  • tqdm==4.28.1
  • torch==1.7.1+cu101
  • torchvision==0.8.2+cu101
  • scipy==1.2.0
  • numpy==1.15.4
  • pillow==8.2.0
  • tensorboardX==1.5
  • editdistance==0.5.3
  • python-dotenv==0.17.1
  • wandb==0.10.30
  • adamp==0.3.0
  • python-dotenv==0.17.1

Download Dataset

sh filename.sh

Dataset Setting

πŸ“Œ ν•™μŠ΅λ°μ΄ν„°λŠ” Dataset Folder와 같이 λ„£μ–΄μ£Όμ„Έμš”!

πŸ“Œ 단일 컬럼으둜 κ΅¬μ„±λœ txtλŠ” \n을 κΈ°μ€€μœΌλ‘œ 데이터λ₯Ό κ΅¬λΆ„ν•˜λ©°, 2개 μ΄μƒμ˜ 컬럼으둜 κ΅¬μ„±λœ txtλŠ” \t둜 μ»¬λŸΌμ„, \n으둜 데이터λ₯Ό κ΅¬λΆ„ν•©λ‹ˆλ‹€.

ν•™μŠ΅λ°μ΄ν„°λŠ” tokens.txt, gt.txt, level.txt, source.txt 총 4개의 파일과 이미지 ν΄λ”λ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

이 쀑 tokens.txt와 gt.txtλŠ” λͺ¨λΈ ν•™μŠ΅μ— κΌ­ ν•„μš”ν•œ μž…λ ₯ 파일이며, level.txt, source.txtλŠ” 이미지에 λŒ€ν•œ 메타 λ°μ΄ν„°λ‘œ 데이터셋 λΆ„λ¦¬μ—μ„œ μ‚¬μš©ν•©λ‹ˆλ‹€.

  • tokens.txtλŠ” ν•™μŠ΅μ— μ‚¬μš©λ˜λŠ” vocabulary νŒŒμΌλ‘œμ„œ λͺ¨λΈ ν•™μŠ΅μ— ν•„μš”ν•œ token듀이 μ •μ˜λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

    O
    \prod
    \downarrow
    ...
    
  • gt.txtλŠ” μ‹€μ œ ν•™μŠ΅μ— μ‚¬μš©ν•˜λŠ” 파일둜 이미지 경둜, LaTex둜 된 Ground Truth둜 각 컬럼이 κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

    train_00000.jpg	4 \times 7 = 2 8
    train_00001.jpg	a ^ { x } > q
    train_00002.jpg	8 \times 9
    ...
    
  • level.txtλŠ” μˆ˜μ‹μ˜ λ‚œμ΄λ„ 정보 파일둜 각 μ»¬λŸΌμ€ κ²½λ‘œμ™€ λ‚œμ΄λ„λ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. 각 μˆ«μžλŠ” 1(μ΄ˆλ“±), 2(쀑등), 3(κ³ λ“±), 4(λŒ€ν•™), 5(λŒ€ν•™ 이상)을 μ˜λ―Έν•©λ‹ˆλ‹€.

    train_00000.jpg	1
    train_00001.jpg	2
    train_00002.jpg	2
    ...
    
  • source.txtλŠ” μ΄λ―Έμ§€μ˜ 좜λ ₯ ν˜•νƒœ 정보 파일둜, μ»¬λŸΌμ€ κ²½λ‘œμ™€ μ†ŒμŠ€λ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. 각 μˆ«μžλŠ” 0(ν”„λ¦°νŠΈ 좜λ ₯λ¬Ό), 1(손글씨)λ₯Ό λœ»ν•©λ‹ˆλ‹€.

    train_00000.jpg	1
    train_00001.jpg	0
    train_00002.jpg	0
    

Create .env for wandb

wandb logging을 μ‚¬μš© μ‹œ wandb에 λ„˜κ²¨μ£Όμ–΄μ•Ό ν•˜λŠ” 인자λ₯Ό .env νŒŒμΌμ— μ •μ˜ν•©λ‹ˆλ‹€.

PROJECT="[wandb project name]"
ENTITY="[wandb nickname]"

Config Setting

ν•™μŠ΅ μ‹œ μ‚¬μš©ν•˜λŠ” config νŒŒμΌμ€ yaml파일둜 ν•™μŠ΅ λͺ©ν‘œμ— 따라 λ‹€μŒκ³Ό 같이 μ„€μ •ν•΄μ£Όμ„Έμš”.

network: SATRN
input_size: # resize image
  height: 48
  width: 192
SATRN:
  encoder:
    hidden_dim: 300
    filter_dim: 1200
    layer_num: 6
    head_num: 8

    shallower_cnn: True # shallow CNN
    adaptive_gate: True # A2DPE
    conv_ff: True # locality-aware feedforward
    separable_ff: True # only if conv_ff is True
  decoder:
    src_dim: 300
    hidden_dim: 300
    filter_dim: 1200
    layer_num: 3
    head_num: 8

checkpoint: "" # load checkpoint
prefix: "./log/satrn" # log folder name

data:
  train: # train dataset file path
    - "/opt/ml/input/data/train_dataset/gt.txt"
  test: # validation dataset file path
    -
  token_paths: # token file path
    - "/opt/ml/input/data/train_dataset/tokens.txt" # 241 tokens
  dataset_proportions: # proportion of data to take from train (not test)
    - 1.0
  random_split: True # if True, random split from train files
  test_proportions: 0.2 # only if random_split is True, create validation set
  crop: True # center crop image
  rgb: 1 # 3 for color, 1 for greyscale

batch_size: 16
num_workers: 8
num_epochs: 200
print_epochs: 1 # print interval
dropout_rate: 0.1
teacher_forcing_ratio: 0.5 # teacher forcing ratio
teacher_forcing_damp: 5e-3 # teacher forcing decay (0 to turn off)
max_grad_norm: 2.0 # gradient clipping
seed: 1234
optimizer:
  optimizer: AdamP
  lr: 5e-4
  weight_decay: 1e-4
  selective_weight_decay: True # no decay in norm and bias
  is_cycle: True # cyclic learning rate scheduler
label_smoothing: 0.2 # label smoothing factor (0 to off)

patience: 30 # stop train after waiting (-1 for off)
save_best_only: True # save best model only

fp16: True # mixed precision

wandb:
  wandb: True # wandb logging
  run_name: "sample_run" # wandb project run name

⏩ Usage

Train

python train.py [--config_file]
  • --config_file: config 파일 경둜

Inference

python inference.py [--checkpoint] [--max_sequence] [--batch_size] [--file_path] [--output_dir]
  • --checkpoint: checkpoint 파일 경둜
  • --max_sequence: inference μ‹œ μ΅œλŒ€ μ‹œν€€μŠ€ 길이
  • --batch_size: 배치 크기
  • --file_path: test dataset 경둜
  • --output_dir: inference κ²°κ³Ό μ €μž₯ 디렉토리

πŸš€ Demo

demo

πŸ“– References

  • On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention, Lee et al., 2019
  • Bag of Tricks for Image ClassiτŽ™Ÿcation with Convolutional Neural Networks, He et al., 2018
  • Averaging Weights Leads to Wider Optima and Better Generalization, Izmailov et al., 2018
  • CSTR: Revisiting Classification Perspective on Scene Text Recognition, Cai et al., 2021
  • Improvement of End-to-End Offline Handwritten Mathematical Expression Recognition by Weakly Supervised Learning, Truong et al., 2020
  • ELECTRA: Pre-training Text Encoders As Discriminators Rather Than Generators, Clark et al., 2020
  • SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition, Qiao et al., 2020
  • Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition, Fang et al., 2021
  • Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Wu et al., 2016

πŸ‘©β€πŸ’» Contributors

κΉ€μ’…μ˜ 민지원 λ°•μ†Œν˜„ 배수민 μ˜€μ„Έλ―Ό 졜재혁
Avatar Avatar Avatar Avatar Avatar Avatar

βœ… License

Distributed under the MIT License. See LICENSE for more information.