/Vary-tiny-600k

Vary-tiny codebase upon LAVIS (for training from scratch)and a PDF image-text pairs data (about 600k including English/Chinese)

Primary LanguagePython

Vary-600k

Background

  • The Huggingface version of Vary-tiny suffers potential issues, leading to the loss being hard to converge under multiple epochs.
  • Many friends are very interested in the train data of Vary.

Release

  • [2024/9/03] 🔥🔥🔥 We release a very strong and comprehensive OCR model GOT-OCR2.0.
  • [2024/4/21] 🔥🔥🔥 For OneChart, we have released the web demo in Project Page. Have fun!!
  • [2024/4/21] 🔥🔥🔥 We present a Vary-tiny LAVIS codebase and the Vary-600k dataset !!!

Contents

Code License Data License Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only.

Install

  1. Clone this repository and navigate to the Vary-tiny-600k folder
git clone https://github.com/Ucas-HaoranWei/Vary-tiny-600k.git
cd LAVIS-main
  1. Install Package
pip install -e .
  1. Prepare Pretrain Weights and Data
    • download the OPT-125M here and the SAM-b weights here
    • download the Vary-600k here with code "vary"
    • prepare the dirs as follows:
    image

Train

python -m torch.distributed.run --nproc_per_node=8 --master_port=29501 train.py --cfg-path lavis/projects/varytiny/train/pretrain.yaml

or multi machines

python -m torch.distributed.run --master_addr xxx --master_port xxx --node_rank xxx --nnodes xxx --nproc_per_node xxx  train.py --cfg-path lavis/projects/varytiny/train/pretrain.yaml

If your training goes smoothly, your loss (end of each epoch) will be similar to the following (2×8 H800):

image

Demo

  1. change the "pretrained" and "finetuned" path with your checkpoints in ``LAVIS-main/lavis/configs/models/varytiny/varytiny_inference.yaml'', such as:
  2. image
python tests/models/test_varytiny.py  --image-file  xxx.jpg
  1. We also provide the model weights we trained Vary-tiny upon Vary-600k from scratch: Vary-tiny-600k.pth. Code: "Vary". You can use it and directly run the inference.

Vary-600k

  • Vary-600k is a PDF image-text pair dataset with about 30W English and 30W Chinese pages.
  • The dataset is extracted using Fitz. A BERT model is used to merge sentences within paragraphs. Paragraphs are separated by "<lb>". The reason why we do not use "\n" is because we use "\n" as the "EOS" of opt-125m in this codebase.
  • You can use Vary-600k for your pretrain, warm-up, and so on.
  • Note that Vary-600k is only a sub-data of the pretrain data used in the original Vary.
  • Download Vary-600k here. Code: "Vary"

Acknowledgement

  • LAVIS: the codebase we built upon!

Citation

If you find our work useful in your research, please consider citing Vary:

@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}

@article{wei2024small,
  title={Small Language Model Meets with Reinforced Vision Vocabulary},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yu, En and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2401.12503},
  year={2024}
}