Build-A-Dataset

If you can "Build-A-Bear" you can Build-A-Dataset

A suite of tools designed for constructing complex and unique datasets without requiring advanced coding skills (everything is run from bash shell). This toolset can be adapted to fit a wide variety of use-cases. Still under active development, may contain bugs. Contributions are welcome.

Sample Datasets

https://huggingface.co/datasets/practicaldreamer/RPGPT_PublicDomain-alpaca https://huggingface.co/datasets/practicaldreamer/RPGPT_PublicDomain-ShareGPT

Scripts and Data Flow

This suite uses a series of JSON files to pass data between the different scripts. Below, you can see each script along with its input and output files.

promptIt.py: Generates an array of messages to be sent to the OpenAI API.

python promptIt.py -input_json <input_file> -output_json <output_file> -list_size <number> -first_prompt <prompt> -next_prompt <prompt> -assistant_prompt <prompt>

askIt.py: Fills in responses using the OpenAI API. threaded version does same thing just threaded and doesn't support chat compounding... I plan to merge the two

python askIt.py -input_json <input_file> -output_json <output_file> -include_chat_history -max_chat_history <number> -resume -api_key <api_key> -api_url <api_url> -model <model> -temperature <value> -top_p <value> -presence_penalty <value> -frequency_penalty <value> -max_tokens <number>

trimIt.py: Trims responses obtained from the OpenAI API.

python trimIt.py -input_json <input_file> -output_json <output_file> -trim_lines_from_start <number> -trim_lines_from_end <number> -trim_assistant_prompt -trim_blanks -last_line_starts_with <string>

splitIt.py: Splits API responses into properties, discarding the conversation.

python splitIt.py -input_json <input_file> -output_json <output_file> -split_on <string> -new_key <key>

mixIt.py: Randomly matches a left dataset with a right dataset.

python mixIt.py -input_json_big <big_input_file> -input_json_small <small_input_file> -output_json <output_file> -iterations <number>

conformIt.py: Finalizes dataset by conforming to alpaca or shareGPT format for training.

python conformIt.py -input_json <prompted_json> -output_json <formatted_json> -format <"Alpaca" or "ShareGPT">

Environmental Variables

The askIt.py script expects the following environment variables to be set or passed via arguments:

openai.api_key: Your OpenAI API key.
openai.api_base: The base URL for the OpenAI API.

Example Usage: build-a-dataset_example_rpgptv1.sh

Included in the repository is build-a-dataset_example_rpgptv1.sh, a script demonstrating a possible way to chain together these tools to build a comprehensive dataset of character conversations in the public domain following an rp format (think Sherlock Holmes and Peter Pan). It progresses through the following steps:

Retrieving genres.
Acquiring books from those genres.
Extracting characters from those books.
Randomly pairing characters.
Gathering scenario moods.
Mixing the character pairs with scenario moods.
Generating scenarios based on these character pairs and moods.
Creating roleplay conversations from the scenarios with moods and character pairs.
Conforming output to ShareGPT and Alpaca Formats

Please note that this is just one example of the multitude of applications for these scripts.

Disclaimer

This project is currently under active development and may still contain bugs. While the scripts have been designed to work for a specific use case, extensive testing has yet to be performed. The functionality of the scripts can vary depending on the data and tasks. Please use them with this understanding.

I welcome contributions to improve this project. If you find any issues or have suggestions, please open an issue or submit a pull request.

License

The Build-A-Dataset project is licensed under the MIT license.