/SynthToT

SynthToT: Generate synthetic dataset for your training dataset through deliberate problem-solving et al S Yao, 2023.

Primary LanguagePythonApache License 2.0Apache-2.0

SynthToT

Generate synthetic dataset for your training dataset through deliberate problem-solving


Table of Content

Introduction

SynthToT is an simple AI agent system powered by Langchain. SynthToT agent developed by Mathematics and AI Institute. It is specifically designed to facilitate the automated generation of synthetic datasets, which are crucial for the training of large language models. SynthToT Agent utilize the renowned Tree of Thoughts: Deliberate Problem Solving with Large Language Models et al. Shunyu Yao, Dian Yu. Tree-of-Thoughts prompting strategy, "which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices."


About Mathematics & AI Institute

At the Math and AI Institute, our mission is to bring state-of-the-art research at the intersection of AI and mathematics. We specialize in generative AI technologies, leveraging their potential to empower industries with innovative projects. Generative AI, a cutting-edge field, enables the creation of diverse content, including text, images, and videos, while adapting seamlessly to new data. While we acknowledge the potential risks associated with the rapid evolution of Generative AI, our team is dedicated to ensuring the safety and interpretability of these tools. Our core areas of focus include AI Safety, Industry Integration, Fundamental Research, and Tailored AI Training for various sectors.

For more information about us, please check out website at matyz.org/en


By implementing this strategy, the SynthToT Agent offers a CLI interface for generating JSON dataset outputs using Tree-of-Thoughts (ToT) reasoning applied to the seed input content. This approach provides a distinctive foundation for creating datasets that are ideal for training state-of-the-art language models, adhering to the following JSON schema:

Output JSON Schema

[
    {
        "input": "",
        "perfect_consideration": "",
        "number": 10,
        "perspective": "",
        "prompt_template": "",
        "solutions": "",
        "thought_process": "",
        "sorted_solutions": "",
        "data_out": ""
    }
    // N items of seed input list
]

Input JSON Schema

[
    {
        "input": "",
        "perfect_consideration": "",
        "number": 10,
        "perspective": "",
    },
    // N items of seed input list
]

You can view example input_data and output_data from under the examples folder.

Features

Chaining

  • Initialization: Customizable parameters such as maximum tokens per response, model type, and sampling temperature.

    • max_tokens: Limits the number of tokens generated per response.
    • model: Specifies the language model to use (default is "gpt-4").
    • temperature: Controls the randomness of the output (default is 0).
  • Template Management: Utilizes a set of predefined Tree-of-Thoughts templates and corresponding output keys for structured data generation.

  • LLMChain: Initializes an LLMChain with a specified prompt template and output key, using the selected language model and parameters.

  • Chain Assembly: Generates a list of LLM chained instances based on the predefined templates and output keys.

Enterprise Grid

At Mathematics and AI Institute, we're proud to offer the enterprise edition of SynthToT, tailored specifically for enterprise users. SynthToT Enterprise Edition provides advanced features designed to deliver a streamlined, scalable, and production-grade system synthetic data generation agent for your organization's needs. With SynthToT Enterprise, you can meet the demands of large-scale data products and enable a complex synthetic dataset generation system for safety-critical applications.

  • Autotransformers: Integrate autotransformers for automated data transformation, enabling efficient and seamless preprocessing of input data for synthetic dataset generation.

  • User Interface (UI): Access SynthToT Enterprise Edition through an intuitive user interface, providing a user-friendly experience for configuring settings, monitoring processes, and accessing generated datasets.

  • Scalability: Scale SynthToT Enterprise Edition effortlessly to accommodate growing datasets and increased computational demands, ensuring seamless performance under heavy workloads.

  • Integration Capabilities: Integrate SynthToT Enterprise Edition seamlessly with existing event-driven data processing systems, and MLOps pipelines for enhanced interoperability and continous data flow.

  • Automated Quality Assurance: Utilize automated quality assurance mechanisms to ensure the accuracy, consistency, and reliability of generated synthetic datasets, reducing manual intervention and error rates.

  • Plugable Thought Templates: Customize the synthetic data generation process by plugging in thought templates, allowing users to define and utilize their own templates tailored to specific use cases and domains.

Get Started with SynthToT Enterprise Edition

To get started with SynthToT Enterprise Edition, contact our team for a demo or trial.

Usage

Initial setup

I. Create a new Conda or virtual environment with the Python version 3.10:

conda create -n synthtotenv python=3.10.11
conda activate synthtotenv

or create virtual environment with venv.

II. Clone the repository, go into folder, and install requirements:

Clone from the remote:

git clone https://github.com/dtunai/SynthToT/

Switch to package folder:

cd SynthToT

Install requirements:

pip install -r setup-requirements.txt

Build the package:

pip install -e .

Preparing Input Data List

Now, you're tasked with creating your input data list. This list will serve as the foundation for generating synthetic output and potential solutions using the Tree-of-Thoughts approach by agents chains. Please take a look at examples folder for input data examples.

Using Tool

After creating your input list, now you can seed the list to the SynthToT via a simple CLI interface:

python synthtot/synthtot.py \
    --input-file <INPUT_FILE_PATH> \
    --output-file <OUTPUT_FILE_PATH> \
    --prompt-pairs-format <system-user-assistant || instruct>
    --model <OPENAI_MODEL_NAME> \
    --max-tokens <MAX_TOKEN_NUMBER> \
    --temperature <TEMPERATURE_FLOAT>

Contributing

If you find a bug, please open a bug report. If you have an idea for an improvement or new feature 🚀, please open a feature request.