/test-prompt-generator

Create prompts with a given token length for testing LLMs and other transformers text models.

Primary LanguagePython

Test Prompt Generator

Create prompts with a given token length for testing LLMs and other transformers text models.

Pre-created prompts for popular model architectures are provided in .jsonl files in the prompts directory.

To generate one or a few prompts, or to test the functionality, you can use the Test Prompt Generator Space on Hugging Face.

Install

pip install git+https://github.com/helena-intel/test-prompt-generator.git transformers

Some tokenizers may require additional dependencies. For example, sentencepiece or protobuf.

Usage

Specify a tokenizer, and the number of tokens the prompt should have. A prompt will be returned that, when tokenized with the given tokenizer, contains the requested number of tokens.

For tokenizer, use a model_id from the Hugging Face hub, a path to a local file, or one of the preset tokenizers: ['bert', 'blenderbot', 'bloom', 'bloomz', 'chatglm3', 'falcon', 'gemma', 'gpt-neox', 'llama', 'magicoder', 'mistral', 'mpt', 'opt', 'phi-2', 'pythia', 'qwen', 'redpajama', 'roberta', 'starcoder', 't5', 'vicuna', 'zephyr']. The preset tokenizers should work for most models with that architecture, but if you want to be sure, use an exact model_id. This list shows the exact tokenizers used for the presets.

Prompts are generated by truncating a given source text at the provided number of tokens. By default Alice in Wonderland is used; you can also provide your own source. A prefix can optionally be prepended to the text, to create prompts like "Please summarize the following text: [text]". The prompts are returned by the function/command line app, and can also optionally be saved to a .jsonl file.

Python API

Basic usage

from test_prompt_generator import generate_prompt

# use preset value for opt tokenizer
prompt = generate_prompt(tokenizer_id="opt", num_tokens=32)
# use model_id
prompt = generate_prompt(tokenizer_id="facebook/opt-2.7b", num_tokens=32)

Slightly less basic usage

Add a source_text_file and prefix. Instead of source_text_file, you can also pass source_text containing a string with the source text.

from test_prompt_generator import generate_prompt

prompt = generate_prompt(
    tokenizer_id="mistral",
    num_tokens=32,
    source_text_file="source.txt",
    prefix="Please translate to Dutch:",
    output_file="prompt_32.jsonl",
)

Use multiple token sizes. When using multiple token sizes, output_file is required, and the generate_prompt function does not return anything. The output_file will contain one line for each token size.

prompt = generate_prompt(
    tokenizer_id="mistral",
    num_tokens=[32,64,128],
    output_file="prompts.jsonl",
)

NOTE: When specifing one token size, the prompt will be returned as string, making it easy to copy and use in a test scenario where you need one prompt. When specifying multiple token sizes a dictionary with the prompts will be returned. The output file is always in .jsonl format, regardless of the number of generated prompts.

Command Line App

test-prompt-generator -t mistral -n 32

Use test-prompt-generator --help to see all options:

usage: test-prompt-generator [-h] -t TOKENIZER -n NUM_TOKENS [-p PREFIX] [-o OUTPUT_FILE] [--overwrite] [-v] [-f FILE]

options:
  -h, --help            show this help message and exit
  -t TOKENIZER, --tokenizer TOKENIZER
                        preset tokenizer id, model_id from Hugging Face hub, or path to local directory with tokenizer files. Options for presets are: ['bert', 'bloom', 'gemma', 'chatglm3', 'falcon', 'gpt-neox',
                        'llama', 'magicoder', 'mistral', 'opt', 'phi-2', 'pythia', 'roberta', 'qwen', 'starcoder', 't5']
  -n NUM_TOKENS, --num_tokens NUM_TOKENS
                        Number of tokens the generated prompt should have. To specify multiple token sizes, use e.g. `-n 16 32`
  -p PREFIX, --prefix PREFIX
                        Optional: prefix that the prompt should start with. Example: 'Translate to Dutch:'
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Optional: Path to store the prompt as .jsonl file
  --overwrite           Overwrite output_file if it already exists.
  -v, --verbose
  -f FILE, --file FILE  Optional: path to text file to generate prompts from. Default text_files/alice.txt

Disclaimer

This software is provided "as is" and for testing purposes only. The author makes no warranties, express or implied, regarding the software's operation, accuracy, or reliability.