/synth

synth. is a framework designed for the generation of synthetic instructions to enhance LLM training.

Primary LanguagePythonApache License 2.0Apache-2.0

synth. 🧙🏼 - pipeline for generating synthetic instructions for LLMs training

synth.

Table of contents


Usage

Step 1. Clone the repository and install the requirements

# Clone the repository
git clone git@github.com:aboros98/synth.git

# (Optional) Create a virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

# Install the requirements
pip install -r requirements.txt

Step 2. Export the API key - Set your API key based on the engine you are using:

export [OPENAI_API_KEY | TOGETHER_API_KEY | ANTHROPIC_API_KEY | MISTRAL_API_KEY]=<your_api_key>

Step 3. Create your config

# Example configuration using the OpenAI engine:
strong_model:
  engine: openai
  model: gpt-4
  seed: 42
  generation_config:
    max_tokens: 4096
    temperature: 0.7
    top_p: 0.9

target_model:
  engine: openai
  model: gpt-3.5-turbo
  seed: 42
  generation_config:
    max_tokens: 4096
    temperature: 0.7
    top_p: 0.9

judge_model:
  engine: openai
  model: gpt-4o
  seed: 42
  generation_config:
    max_tokens: 512
    temperature: 0.0
    top_p: 1.0

pipeline:
  n_instructions: 2
  n_rubrics: 4
  n_iterations: 4
  margin_threshold: 0.5
  output_path: <path_to_output>
  dataset_path: <local_path_or_huggingface_dataset>

Step 4. Run the pipeline

python main.py -c <path_to_config_file> -p <number_of_parallel_processes>

CodecLM pipeline overview

CodecLM Pipeline

Figure 1: Overview of the CodecLM Pipeline

Pipeline steps

  1. Instruction encoding

    • Encodes the instruction into skills and use-cases using a robust LLM.
    • Model Used: Strong LLM
  2. Simple instructions generation

    • Generates a new instruction based on skills and use-cases without accessing the original instruction.
    • Model Used: Strong LLM
  3. Rubric and action generation

    • Generates n rubrics and corresponding actions. The actions are used to enhance the instruction.
    • Model Used: Strong LLM
  4. Instruction improvement

    • Enhances the instruction according to a randomly selected rubric and action.
    • Model Used: Strong LLM
  5. Generated instructions answers generation

    • Generates answers for the instructions.
    • Model Used: Strong and Target LLMs
  6. Instruction-answer pair ranking

    • Ranks instruction-answer pairs, scoring responses from both LLMs.
    • Model Used: Strong LLM
  7. Contrastive filtering

    • Filters instructions based on the score margin, sending those below the threshold back to Step 3 for further improvement.
    • Model Used: Strong LLM
  8. Final generated instructions ranking

    • Ranks the final generated instructions using a judge model. This step is optional. The score given by the Judge LLM is between 0 and 5.
    • Model Used: Judge LLM

Results

After running the pipeline, three files will be saved:

  1. generated_dataset.json - Contains the final generated dataset.
{
    "instruction": "The generated instruction",
    "answer": "The generated answer for the instruction",
    "model": "The model used to generate the answer",
    "contrastive_score": "The contrastive score of the instruction-answer pair",
    "judge_instruction_score": "The score given by the JudgeLLM",
    "judge_reason": "The reason for the score given by the JudgeLLM",
    "judge_model_name": "The JudgeLLM model used to score the instruction-answer pair",
    "topic": "The topic of the instruction",
    "subtopic": "The subtopic of the instruction"
}
  1. processed_data.json - Contains all the data processed by the pipeline.
{
    "instruction_index": "The index of the instruction",
    "seed_instruction": "The seed instruction from the dataset",
    "task": "The task of the instruction",
    "skills": "The skills extracted from the instruction",
    "rubrics": "The rubrics generated for the instruction",
    "actions": "The actions generated for the instruction",
    "simple_instructions": "The simple instructions generated given the skills and task",
    "strong_model": "The StrongLLM model",
    "target_model": "The TargetLLM model",
    "improved_instructions": {
        "improvement_step": "The step of the improvement",
        "original_instruction": "The simple instruction",
        "rubric": "The sampled rubric used for the improvement",
        "action": "The sampled action used for the improvement",
        "improved_instruction": "The improved instruction",
        "strong_answer": "The answer generated by the StrongLLM",
        "target_answer": "The answer generated by the TargetLLM",
        "strong_score": "The strong answer score from the contrastive filtering",
        "target_score": "The target answer score from the contrastive filtering",
        "improvement_history": "All the history of the improvements"
    }
}
  1. skipped_data.json - The index of the instructions that were skipped by the pipeline.
[
    "The index of the skipped instructions"
]

Examples

Example 1

Figure 2. Example 1 - output from the CodecLM pipeline


Example 2

Figure 3. Example 2 - output from the CodecLM pipeline

Supported engines


Generation Engine Supported
OpenAI
Mistral
Anthropic
TogetherAI

References


License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.