We have edited the code of project DeepSpeed-Chat to support many new features as shown below.

Our New Features🎉🎉🎉

We propose a hybrid alignment training to improve the LLM (./examples/hybrid_alignment_training).
Add extra loss for RLHF in step3 like SFT loss and pre-trained loss (./examples/add_extra_loss_for_rlhf).
Support DPO as step2 (./examples/dpo).
Implement ESRL features to train efficiently in step3 (./examples/esrl).
Support COMET model(s) as reward model(s) in step3 RLHF (./examples/rlhf_with_comet_reward).
Support using scores instead of pairwise data only to train reward models directly (./examples/training_reward_with_scores).

More details in ./examples.

Installation

You can use anaconda/miniconda to install packages needed for this project.

conda env create -f conda-env.yml
conda activate dschat
pip install -r requirements.txt

Training Models

Step1 Supervised Fine-tuning (SFT)

bash scripts/sft.sh

Step2 Reward Model Fine-tuning

bash scripts/reward.sh

Step2 Direct Pereference Optimization (DPO)

bash examples/dpo/train.sh

Step3 Reinforcement Learning from Human Feedback (RLHF)

bash scripts/rlhf.sh

Supported Models

Model	Model size
Baichuan	7B/13B
Baichuan2	7B/13B
LLaMA	7B/13B/33B/65B
LLaMA-2	7B/13B/70B
Yi	6B/34B

Format of the Dataset

SFT

The dataset for SFT should be txt files including train.txt and test.txt with sft in path such as /your/path/to/sft_dataset/train.txt, containing a json string each line as example below.

Example:

{"instruction": "User: Your task is to ... \nAssistant: ", "input": "...", "output": "..."}
...

SFT with Multi-turn History

We also support sft training with multi-turn dialogues. The corresponding dataset also contains a json string on each line, as shown in the example below.

Example:

{
 "instruction": "User: Your task is to ... \nAssistant: ",
 "input": "...",
 "output": "...",
 "history": [
              ["user instruction in the first round (optional)", "model response in the first round (optional)"],
              ["user instruction in the second round (optional)", "model response in the second round (optional)"],
              ...
            ]
}
...

Reward/DPO

The dataset for Reward/DPO should be parquet files including train.parquet and test.parquet with reward in path such as /your/path/to/reward_dataset/train.parquet, containing four keys each entry as example below.

Example:

prompt	response	chosen	rejected
User: What are some of the challenges with usi...	Some of the challenges with using machine lear...	Some of the challenges with using machine lear...	Machine learning is a very powerful tool.
User: Looking for an essay by a contemporary m...	I believe you're thinking of Bernard-Henri Lévy.	I believe you're thinking of Bernard-Henri Lévy.	Laclau maybe?
...	...	...	...

RLHF

Same as SFT, except for rlhf in path such as /your/path/to/rlhf_dataset/train.txt.

Inference

You can use this python script for inference as shown in ./scripts/predict.sh in which the input should be in format of {Input} ||| {None/Reference} while output would be {Input} ||| {ModelOutput} ||| {None/Reference} as example below.

Example:

input.txt

User: What are the names of some famous actors ...\nAssistant: ||| Some famous ...
User: ...                                                      ||| None
...                                                            ||| ...

output.txt

User: What are the names of some famous actors ...\nAssistant: ||| 1. Denzel Washington ... ||| Some famous ...
User: ...                                                      ||| ...                      ||| None
...                                                            ||| ...                      ||| ...

Last but Not Least

Thanks to the DeepSpeed-Chat project and its contributors❤️❤️❤️!

wangclnlp/DeepSpeed-Chat-Extension

Our New Features🎉🎉🎉

Installation

Training Models

Step1 Supervised Fine-tuning (SFT)

Step2 Reward Model Fine-tuning

Step2 Direct Pereference Optimization (DPO)

Step3 Reinforcement Learning from Human Feedback (RLHF)

Supported Models

Format of the Dataset

SFT

SFT with Multi-turn History

Reward/DPO

RLHF

Inference

Last but Not Least