/TIM

code for Teaching LM to Translate with Comparison

Primary LanguagePython

image

TIM: Teaching LM to Translate with Comparison

Support

  • Please refer our paper for more detail.

Tips

  • [20231215] We added the flash-attention for faster training. we can set --use_flash_attention to active flash-attention.
  • [20230914] We update the preference loss function of TIM, which makes the training more stable.
  • [20230914] We fix the bug when using Data Cache (i.e., --streaming=False) for training.
  • When datastreaming is turned on, it is recommended to shuffle the training data first.
  • When training with Deepspeed ZeRO stage 1/2, we can set --use_low_cpu_mem=True to save memory usage
  • After training a model using Deepspeed ZeRO stage3, we need to use sft_reward_training/change_param_name.py to perform a transformation of the model's parameter names before inference.

Quick start

Environment

We develop TIM with HuggingFaces's transformers and Deepspeed-chat.

Requirements:

  • Python 3.7.9
  • Pytorch 1.10.0+cu111
  • Transformers 4.28
  • accelerate==0.19.0
  • numpy==1.22.4
  • deepspeed==0.9.0
  • scikit-learn
  • flash-attn==2.0.1

Datasets

Data Construction for TIM

We modify add_noisy.py in noisy-text.

We use the following setting in our paper:

   python add_noise.py data/example --delete_probability 0.15 --replace_probability 0.15  --filler_token '' --permutation_range 1

Then, you can run [run_reward.sh] to get the final training data for TIM.

Instruct Tuning with TIM

We modify run_clm.py and Trainer in transformers, and utils for LoRA in Deepspeed-Chat. In addition to vanilla fine-tuning all model parameters, parameter-efficient fine-tuning methods are specially proposed for large language models such as prefix tuning and LoRA. We adopt three different strategies for tuning the models, listed in descending order from the number of fine-tuned parameters.

(1) LoRA: Tuning with Low-rank Matrices

   LORA_MODULE_NAME="query_key_value" # for BLOOM
   LORA_MODULE_NAME="q_proj,k_proj,v_proj,o_proj" # for Llama

   --only_optimize_lora    # if True, only optimizing the parameters of LoRA
   --lora_dim 8  
   --lora_alpha 16 
   --lora_droppout 0.05 
   --lora_module_name ${LORA_MODULE_NAME} 

(2) FixEmb: Tuning with Embedding Fixed

   --only_optimize_layers "9" "8" "7" "6" "5" "4" "3" "2" "1" "0" 

(2) Full: Tuning with Full Parameters

Deepspeed Config

  • deepspeed_config/ds_config.json, deepspeed_config/ds_config_stage2.json, deepspeed_config/ds_config_stage3.json

Inference

   -l            # using LoRA
   --rootmodel   # if LoRA, the path of the foundation model
   --ifhint      # add note indicates no mistakes in the hypothesize
   --ifsample    # if true, use sample else beam search for inference
   --ifreranking # use the preference score to select a preferred hypothesize in candidates
   --vocab       # the dictionary for dict-guided inference
   --reverse     # whether reverse the src language and tgt language when loading the dictionary

Experimental Results

We evaluate TIM's performance on the WMT and FLORES-200 dev-test tasks, comprising four language pairs.

result
result
### Citation Please kindly cite our paper if you find it helpful:
@inproceedings{zeng2023tim,
  title={TIM: Teaching LM to Translate with Comparison}, 
  author={Jiali Zeng and Fandong Meng and Yongjing Yin and Jie Zhou},
  booktitle = {ArXiv},
  year      = {2023},
  url = {https://arxiv.org/pdf/2307.04408.pdf}
}