In this repository scripts are provided to build your own instruction dataset through OpenAI services. We specifically make use of Azure services.
If you use the Azure services in the following scripts, you will need to specify a credentials file. This file should
have the following structure, where each key is a "profile", like "gpt-4". In the examples below, this has been
saved to a file called .credentials.json
.
{
"gpt-4": {
"endpoint": "https://abc.openai.azure.com/",
"api_key": "[secret-key]",
"api_version": "2023-07-01-preview",
"deployment_name": "deployment-name1"
},
"gpt-35-turbo": {
"endpoint": "https://def.openai.azure.com/",
"api_key": "[secret-key]",
"api_version": "2023-07-01-preview",
"deployment_name": "deployment-name2"
}
}
For all commands a --help
option is available with more explanations about all the arguments.
Launch an interactive query session. This will allow you to query the OpenAI API and "talk" to the model. This implementation is not very smart, and will not do any smart length filtering when you exceed the context window. So do not use it for extended conversations.
It supports both Azure services and Hugging Face models.
Example usage Azure with the gpt-35-turbo
profile:
interactive-query azure .credentials.json gpt-35-turbo
Example usage Hugging Face with the BramVanroy/Llama-2-13b-chat-dutch
model (transformers
must be installed,
and for many options w.r.t. quantization you will also need accelerate
and bitsandbytes
):
interactive-query huggingface BramVanroy/Llama-2-13b-chat-dutch --load-in-8bit
Most of the time we want to start with translating system message and/or user messages, and then "answer" those later
on in a next step. translate-hf
is the entry point to translate specific columns and splits of any dataset on the
Hugging Face hub. It will save the translated dataset to a temporary location, and then upload it to the hub.
It should be relatively robust as it saves intermediate results and can simply restart where it left off.
Example usage:
translate-hf HuggingFaceH4/ultrachat_200k data/ultrachat_200k/ultrachat_200k-gpt-4-turbo-translated --split train_sft --split test_sft --columns prompt --src-lang English --tgt-lang Dutch --output-hub-name BramVanroy/ultrachat_200k_dutch --output-hub-revision 1-gpt-4-turbo-translated -j 8 --system-prompt .transl_sysprompt_en-nl
This will:
- Translate the
train_sft
andtest_sft
splits ofHuggingFaceH4/ultrachat_200k
from English to Dutch - It will save temporary results to
data/ultrachat_200k/ultrachat_200k-gpt-4-turbo-translated
- It will upload the final dataset to revision (branch)
1-gpt-4-turbo-translated
in theBramVanroy/ultrachat_200k_dutch
dataset - It will use 8 processes to speed up the translation
- It will use the
.transl_sysprompt_en-nl
file that contains a system prompt as the system message
In the next step we want to use models or APIs to generate an answer to given columns. This script will do that for
you. The only required input that is used is the given user-column
as the user message, optionally a system-column
,
and the model answer to those will be saved into the response-column
(defautls to response
).
Example usage:
answer-hf --help
This script allows you to build a conversation in a single model response. Importantly, the specified system_prompt is
supposed to tell the model to create a multi-turn conversation and also give an example of such a conversation, with
specified identifiers for the user and assistant in the generated conversation. These identifiers should also be given in
this script (defaults to user:
and assistant:
).
You can also specifiy personas with --personas
which should be a JSON file containinga main key personas
with
persona names and their descriptions, which can then be passed to the system_prompt as long as it has a {persona}
field in its text. The JSON file can optionally also have a weights
key, which indicates how randomly weighted
the different personas are chosen. If not given, all personas are equally likely. To repeat: when you provide a
personas
file, the persona descriptions will be randomly selected for each sample (optionally weighted) and
plugged into the system_prompt that you provided as long as that text (file) contains the string {persona}
.
Example usage:
answer-hf --help
An interactive script to add language identification to specified columns in your dataset.
The script handles messages
(lists of dictionaries) by simply concatenating all content keys.
The script will add {colname}_lid
and {colname}_lid_prob
columns to your dataset.
Usage: simply run interactive-lid
and follow instructions.
An interactive script to filter out non-Dutch messages from your dataset. It does so based on the columns
added with interactive-lid
so that script should be used first.
In addition to language filtering, it also allows you to filter out messages with specific characteristics. Text matching occurs in a case-insensitive manner.
- messages with non-Roman characters are removed (every character must have "LATIN" in its unicode name; note that this solution is not flawless: https://stackoverflow.com/a/3308844/1150683) This is a very strict filter and will lead to the removal of data that you may have wanted to keep (e.g. messages that involve a translation task to non-Latin script languages)
- messages that are not identified as Dutch and that are longer than three white-space separated tokens are removed
- any text containing "spijt me", "spijt mij", "sorry", "mijn excuses", because those often indicate that the system could not successfully reply to a request
- any text containing "It seems like there was a typo", "assistant", because those often indicate that the system could not successfully reply to a request. Note that
assistant
is the English word (assistent
is Dutch), so whenassistant
appears something is likely wrong - any text indicating knowledge cut-offs:
- kennisafsluiting in 2023
- kennisstop in 2023
- kennisafsnijdatum van 2023
- cutoff in 2023
- Tot mijn kennis die bijgewerkt is tot begin 2023
- Voor zover mijn kennis reikt tot 2023
- Vanaf mijn kennis tot begin 2023
- As of my last update in 2023
- any text referencing other language models
- ChatGPT
- Chat GPT
- GPT3
- GPT 3
- gpt-3
- gpt-3.5-turbo
- GPT4
- GPT 4
- gpt-4
- gpt-4-turbo
- OpenAI
- ShareGPT
- any self-referencing text about being a language model. This often indicates that a model is not capable of a specific task, in case we drop those samples to instead focus on the tasks that it can do.
The following strings are matched in a template for all occurrences of "als [een] {}", "ben [een] {}", "{} ben"
- AI-assistent
- AI-gebaseerde assistent
- virtuele assistent
- digitale assistent
- tekst-assistent
- AI tekstgebaseerde asssistent
- tekstgebaseerde asssistent
- assistent
- taalmodel
- AI-taalmodel
- AI taalmodel
Licensed under GPLv3.