LLM-Pretrain-SFT

Scripts of LLM pretraining and finetuing (SFT)

LoRA supported

The repository is based on tatsu-lab/stanford_alpaca.

Supported LLM

Pretrain (Continual Pretrain)

Before you start continual pre-training LLM, you should provide the model name (huggingface) or local model path.
Prepare training data, you can use plain text in the format of markdown or txt for pretraining. The example is A Guide to Writing the NeurIPS Impact Statement. You can add more text corpus in the data folder.
Launch

pip install -r requirements.txt
cd llm_pretrain
./pretrain_llama.sh

Note that some parameter settings of these models are different.

SFT

Before you start fine-tuning LLM, you should provide the model name (huggingface) or local model path.
Prepare training data, you can add your own task data like the example in sft_examples.json, which is similar to the alpaca_data.json

The format is as follows:

{
    "binary_selection": [
    {
            "instruction": "Does the following text violate the law?\nText: OH MY FUCKING GOD",
            "output": "No"
    },
    ...
    ],
    "another_task_name": [
    {
            "instruction": "How are you?",
            "output": "Not bad."
    },
    ...
    ],
    ...
}

Note that if you put the alpaca_data.json in the data folder, the script will use it as part of the training data.

Launch

Full Parameters

pip install -r requirements.txt
cd llm_sft
./train_llama.sh

LoRA

pip install -r requirements.txt
cd llm_sft
./train_baichuan_LORA.sh

You can adjust the configurations in the train_lora.py. In our experiments, for baichuan, your transformers version should >= 4.29.0 and < 4.34.0.