/datasetGPT

A command-line interface to generate textual and conversational datasets with LLMs.

Primary LanguagePython

datasetGPT

datasetGPT is a command-line interface and a Python library for inferencing Large Language Models to generate textual datasets.

Possible use cases may include:

  • Constructing textual corpora to train/fine-tune detectors for content written by AI.
  • Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc.
  • Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command.
  • Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.

This tool is distributed freely and doesn't imply any restrictions on the downstream use cases. However, you should make sure to follow the Terms of use of the backend APIs (OpenAI, Cohere, Petals, etc.) in your specific context.

Installation

pip install datasetGPT

Most of the generation features rely on third-party APIs. Install their respective packages:

pip install openai cohere petals

Usage examples

Inference LLMs at scale

export OPENAI_API_KEY="..."
export COHERE_API_KEY="..."

datasetGPT texts \
    --prompt "If {country} was a planet in the Star Wars universe it would be called" \
    --backend "openai|text-davinci-003" \
    --backend "cohere|medium" \
    --temperature 0.9 \
    --option country Germany \
    --option country France \
    --max-length 50 \
    --num-samples 1 \
    --single-file

The command above should produce a dataset file with 4 texts. Each possible combination of options is used for each of the backend LLMs. Check out the CLI reference for more details. A dataset file looks like this:

[
    {
        "sample_id": 0,
        "backend": "openai|text-davinci-003",
        "max_length": 50,
        "temperature": 0.9,
        "country": "Germany",
        "prompt": "If Germany was a planet in the Star Wars universe it would be called",
        "output": " Euron. The planet would be home to a powerful and diverse species of aliens, known as the Eurons, that have evolved to a higher level of understanding and technological advancement compared to many of the other planets in the galaxy. The planet would be"
    },
    {
        "sample_id": 0,
        "backend": "openai|text-davinci-003",
        "max_length": 50,
        "temperature": 0.9,
        "country": "France",
        "prompt": "If France was a planet in the Star Wars universe it would be called",
        "output": " The Empire of Liberty. It would be a peaceful, democratic planet with a strong sense of justice and equality. The planet would be home to many different species of aliens but the majority of its population would be humans. It would have a strong military and"
    },
    {
        "sample_id": 0,
        "backend": "cohere|medium",
        "max_length": 50,
        "temperature": 0.9,
        "country": "Germany",
        "prompt": "If Germany was a planet in the Star Wars universe it would be called",
        "output": " the Hoth of the universe.\nAfter the Soviet invasion of Eastern Europe and the subsequent Western anti-Soviet sentiment, Germany's arms manufacturers went into hyperdrive and the country churned out guns at a frightening pace. By the early 1930"
    },
    ...
]

Alternatively, you can use our TextsGenerator to produce texts in Python.

Generate conversations with the ChatGPT API

export OPENAI_API_KEY="..."
datasetGPT conversations \
    --length 4 \
    --length 5 \
    --agent1 "You're a shop assistant in a pet store. Answer to customer questions politely." \
    --agent2 "You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store." \
    --temperature 0.1 \
    --temperature 0.2 \
    --option n 2 \
    --option n 3 \
    --path dataset

The command above should produce a dataset directory with 16 conversations saved as JSON files. You can specify if you want all of them to be saved in a single file. But why 16? Because num_samples dialogues are generated for each possible combination of parameters (conversation length, LLM temperature, and custom prompt options). A dataset item looks like this:

{
    "length": 5,
    "temperature": 0.1,
    "n": "2",
    "agent1": "You're a shop assistant in a pet store. Answer to customer questions politely. When the whole conversation is over end with \"Goodbye\".",
    "agent2": "You're a customer in a pet store. You should behave like a human. You want to buy 2 pets. Ask questions about the pets in the store. When the whole conversation is over end with \"Goodbye\".",
    "utterances": [
        [
            "agent1",
            "Hello! How can I assist you today?"
        ],
        [
            "agent2",
            "Hi! I'm interested in buying two pets. Can you tell me what kind of pets you have available in the store?"
        ],
        [
            "agent1",
            "Certainly! We have a variety of pets available, including dogs, cats, birds, fish, hamsters, guinea pigs, rabbits, and reptiles. Is there a specific type of pet you're interested in?"
        ],
        [
            "agent2",
            "I'm not sure yet. Can you tell me more about the dogs and cats you have available? What breeds do you have?"
        ],
        ...
    ]
}

You can also use our ConversationsGenerator to produce texts with the gpt-3.5-turbo API programatically.

from datasetGPT import ConversationsGenerator, ConversationsGeneratorConfig, DatasetWriter

dataset_writer = DatasetWriter() # single_file=True

generator_config = ConversationsGeneratorConfig(agent1="You're a shop assistant in a pet store. Answer to customer questions politely.",
                                                agent2="You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.",
                                                num_samples=2,
                                                interruption="length",
                                                lengths=[4, 5],
                                                temperatures=[0.1, 0.2],
                                                options=[("n", "2"), ("n", "3")],
                                                model="gpt-3.5-turbo", # or model="gpt-4"
                                                # Specific models per agent can be set. If one of the parameters is set, the other must also be provided, otherwise both agents use the "model" value.
                                                model_agent_one="gpt-3.5-turbo",
                                                model_agent_two="gpt-4"

                                             )

conversations_generator = ConversationsGenerator(generator_config)

for conversation in conversations_generator:
    dataset_writer.save_intermediate_result(conversation)

Contributing

Still under active development.

Contributions will be highly appreciated. Currently these features are under development:

  • datasetGPT conversations - Make two ChatGPT agents talk with one another and record the conversation history.
  • datasetGPT texts - Inference different LLMs with a given input prompt and generate multiple outputs by varying parameters.
  • datasetGPT transformations - Apply a list of transformations to a list of texts. For example, summarizing a list of texts for a {child | university student | PhD candidate} to understand.
  • Support more backend LLMs.

To set up a local development environment:

git clone https://github.com/radi-cho/datasetGPT/
cd datasetGPT
pip install -e .

CLI Reference

datasetGPT [OPTIONS] COMMAND [ARGS]...

  Command line interface that generates datasets with LLMs.

Options:
  --help  Show this message and exit.

Commands:
  conversations  Produce conversations between two gpt-3.5-turbo agents...
  texts          Inference multiple LLMs at scale.
datasetGPT texts [OPTIONS]

  Inference multiple LLMs at scale.

Options:
  -p, --prompt TEXT            Input prompt.  [required]
  -b, --backend TEXT           LLM APIs to use as backends. Use
                               "backend|model_name" notation. For example:
                               "openai|text-davinci-003".
  -l, --max-length INTEGER     Maximum number of tokens to generate for each
                               prompt.
  -t, --temperature FLOAT      Possible temperature values for the backend
                               language model.
  -n, --num-samples INTEGER    Number of conversations for each configuration.
  -o, --option <TEXT TEXT>...  Values for additional options denoted in your
                               prompts by {OPTION_NAME}.
  -f, --path PATH              Where to save the dataset. Either a file or a
                               directory (folder).
  -s, --single-file            Either save the whole dataset to a single file
                               or create multiple files.
  --help                       Show this message and exit.
  • You can specify multiple variants for the following options: --length, --temperature, --num-samples, --option. A dataset item will be generated for each possible combination of the supplied values.
  • Each --option provided must be formatted as follows: --option option_name "Some option value".
  • Currently supported backends: GPT-3 and GPT-4 model variants by OpenAI, the language models by Cohere, BLOOM through the Petals API.
datasetGPT conversations [OPTIONS]

  Produce conversations between two gpt-3.5-turbo agents with given roles.

Options:
  -k, --openai-api-key TEXT       OpenAI API key.
  -a, --agent1 TEXT               Agent role description.  [required]
  -b, --agent2 TEXT               Agent role description.  [required]
  -i, --interruption [length|end_phrase]
                                  Interruption mode.
  -e, --end-phrase TEXT           Interrupt after this phrase is outputted by
                                  one of the agents.
  -d, --end-agent [agent1|agent2|both]
                                  In which agent's messages to look for the
                                  end phrase.
  -l, --length INTEGER            Maximum number of utterances for each agent.
                                  A conversation sample will be generated for
                                  each length.
  -t, --temperature FLOAT         Possible temperature values for the backend
                                  language model.
  -m, --model [gpt-3.5-turbo|gpt-4]
                                  OpenAI Chat model to use. GPT-4 is only
                                  supported if provided API key has access to
                                  GPT4. Defaults to GPT-3.5-Turbo.
  -m1, --model-agent1 [gpt-3.5-turbo|gpt-4]
                                  OpenAI Chat model to use for agent1. GPT-4
                                  is only supported if given API key has
                                  access to GPT4. Defaults to GPT-3.5-Turbo.
                                  If set, --model-agent2 must also be
                                  provided, otherwise --model value will be
                                  used.
  -m2, --model-agent2 [gpt-3.5-turbo|gpt-4]
                                  OpenAI Chat model to use for agent2. GPT-4
                                  is only supported if given API key has
                                  access to GPT4. Defaults to GPT-3.5-Turbo If
                                  set, --model-agent1 must also be provided,
                                  otherwise --model value will be used.
  -n, --num-samples INTEGER       Number of conversations for each
                                  configuration.
  -o, --option <TEXT TEXT>...     Values for additional options denoted in
                                  your prompts by {OPTION_NAME}.
  -f, --path PATH                 Where to save the dataset. Either a file or
                                  a directory (folder).
  -s, --single-file               Either save the whole dataset to a single
                                  file or create multiple files.
  --help                          Show this message and exit.
  • The length parameter specifies how many utterances each agent should make. A length of 4 typically produces 8 utterances in total.
  • You can specify either length (default) or end_phrase as an interruption strategy. When using end_phrase a conversation will be interrupted once the --end-phrase has appeared in the messages of the --end-agent (could be both). In this case, the lengths provided will be treated as maximum conversation lengths.