LLM Classifier - Instantly classify data with Lamini & Llama 2
Train a new classifier with just a prompt. No data needed -- but add data to boost, if you have it.
from lamini import LaminiClassifier
llm = LaminiClassifier()
prompts={
"cat": "Cats are generally more independent and aloof than dogs, who are often more social and affectionate. Cats are also more territorial and may be more aggressive when defending their territory. Cats are self-grooming animals, using their tongues to keep their coats clean and healthy. Cats use body language and vocalizations, such as meowing and purring, to communicate.",
"dog": "Dogs are more pack-oriented and tend to be more loyal to their human family. Dogs, on the other hand, often require regular grooming from their owners, including brushing and bathing. Dogs use body language and barking to convey their messages. Dogs are also more responsive to human commands and can be trained to perform a wide range of tasks.",
}
llm.prompt_train(prompts)
llm.save("models/my_model.lamini")
Then, predict!
llm.predict(["meow"])
>> ["cat"]
llm.predict(["meow", "woof"])
>> ["cat", "dog"]
Note: each prompt class makes 10 LLM inference calls. See advanced section below to change this.
This can help with improving your classifier. For example, if the LLM is ever wrong:
llm.predict(["i like milk", "i like bones"])
>> ["dog", "cat"] # wrong!
And your LLM classifier will learn it:
llm = LaminiClassifier()
llm.add_data_to_class("cat", "i like milk.")
llm.add_data_to_class("dog", ["i like bones"]) # list of examples is valid too
llm.prompt_train(prompts)
llm.predict(["i like milk", "i like bones"])
>> ["cat", "dog"] # correct!
If you include data on classes that aren't in your classes
, then the classifier will include them as new classes, and learn to predict them. However, without a prompt, it won't have a description to use to further boost them.
General guideline: if you don't have any or little data on a class, then make sure to include a good prompt for it. Like prompt-engineering any LLM, creating good descriptions---e.g. with details and examples---helps the LLM get the right thing.
# Load data
llm.saved_examples_path = "path/to/examples.jsonl" # overrides default at /tmp/saved_examples.jsonl
llm.load_examples()
# Print data
print(llm.get_data())
# Train on that data (with prompts)
llm.prompt_train(prompts)
# Just train on that data (without any prompts)
llm.train()
Finally, you can save the data.
# Save data
llm.saved_examples_path = "path/to/examples.jsonl"
llm.save_examples()
Format of what examples.jsonl
looks like:
{"class_name": "cat", "examples": ["i like milk", "meow"]}
{"class_name": "dog", "examples": ["woof", "i like bones"]}
Change the number of LLM examples (and thus inference calls up to this number) per prompt:
llm = LaminiClassifier(augmented_example_count=5) # 10 is default
Note that we found 10 to be a good proxy for training an effective classifier.
If you're working with more classes and want to lift performance, a higher augmented_example_count
can help.
./train.sh
We have some default classes. You can specify your own super easily like this:
./train.sh --class "cat: CAT_PROMPT" --class "dog: DOG_PROMPT"
The prompts are descriptions of your classes.
./classify.sh 'woof'
You can get the probabilities for all the classes, in this case dog
(62%) and cat
(38%). These can help with gauging uncertainty.
{
'data': 'woof',
'prediction': 'dog',
'probabilities': array([0.37996491, 0.62003509])
}
Here are our cat/dog prompts.
Cat prompt:
Cats are generally more independent and aloof. Cats are also more territorial and may be more aggressive when defending their territory.
Cats are self-grooming animals, using their tongues to keep their coats clean and healthy. Cats use body language and vocalizations,
such as meowing and purring, to communicate. An example cat is whiskers, who is a cat who lives in a house with a human.
Another example cat is furball, who likes to eat food and sleep. A famous cat is garfield, who is a cat who likes to eat lasagna.
Dog prompt:
Dogs are social animals that live in groups, called packs, in the wild. They are also highly intelligent and trainable.
Dogs are also known for their loyalty and affection towards their owners. Dogs are also known for their ability to learn and
perform a variety of tasks, such as herding, hunting, and guarding. An example dog is snoopy, who is the best friend of
charlie brown. Another example dog is clifford, who is a big red dog.
./classify.sh --data "I like to sharpen my claws on the furniture." --data "I like to roll in the mud." --data "I like to run any play with a ball." --data "I like to sleep under the bed and purr." --data "My owner is charlie brown." --data "Meow, human! I'm famished! Where's my food?" --data "Purr-fect." --data "Hiss! Who dared to wake me from my nap? I'll have my revenge... later." --data "I'm so happy to see you! Can we go for a walk/play fetch/get treats now?" --data "I'm feeling a little ruff today, can you give me a belly rub to make me feel better?"
{'data': 'I like to sharpen my claws on the furniture.',
'prediction': 'cat',
'probabilities': array([0.55363432, 0.44636568])}
{'data': 'I like to roll in the mud.',
'prediction': 'dog',
'probabilities': array([0.4563782, 0.5436218])}
{'data': 'I like to run any play with a ball.',
'prediction': 'dog',
'probabilities': array([0.44391914, 0.55608086])}
{'data': 'I like to sleep under the bed and purr.',
'prediction': 'cat',
'probabilities': array([0.51146226, 0.48853774])}
{'data': 'My owner is charlie brown.',
'prediction': 'dog',
'probabilities': array([0.40052991, 0.59947009])}
{'data': "Meow, human! I'm famished! Where's my food?",
'prediction': 'cat',
'probabilities': array([0.5172964, 0.4827036])}
{'data': 'Purr-fect.',
'prediction': 'cat',
'probabilities': array([0.50431873, 0.49568127])}
{'data': "Hiss! Who dared to wake me from my nap? I'll have my revenge... "
'later.',
'prediction': 'cat',
'probabilities': array([0.50088163, 0.49911837])}
{'data': "I'm so happy to see you! Can we go for a walk/play fetch/get treats "
'now?',
'prediction': 'dog',
'probabilities': array([0.42178513, 0.57821487])}
{'data': "I'm feeling a little ruff today, can you give me a belly rub to make "
'me feel better?',
'prediction': 'dog',
'probabilities': array([0.46141002, 0.53858998])}
Clone this repo, and run the train.sh
or classify.sh
command line tools.
Requires docker: https://docs.docker.com/get-docker
Setup your lamini keys (free): https://lamini-ai.github.io/
git clone git@github.com:lamini-ai/llm-classifier.git
cd llm-classifier
Train a new classifier.
./train.sh --help
usage: train.py [-h] [--class CLASS [CLASS ...]] [--train TRAIN [TRAIN ...]] [--save SAVE] [-v]
options:
-h, --help show this help message and exit
--class CLASS [CLASS ...]
The classes to use for classification, in the format 'class_name:prompt'.
--train TRAIN [TRAIN ...]
The training data to use for classification, in the format 'class_name:data'.
--save SAVE The path to save the model to.
-v, --verbose Whether to print verbose output.
Classify your data.
./classify.sh --help
usage: classify.py [-h] [--data DATA [DATA ...]] [--load LOAD] [-v] [classify ...]
positional arguments:
classify The data to classify.
options:
-h, --help show this help message and exit
--data DATA [DATA ...]
The training data to use for classification, any string.
--load LOAD The path to load the model from.
-v, --verbose Whether to print verbose output.
These command line scripts just call python inside of docker so you don't have to care about an environment.
If you hate docker, you can also run this from python easily...
Install it
pip install lamini
Instantiate a classifier
from lamini import LaminiClassifier
# Create a new classifier
classifier = LaminiClassifier()
Define classes using prompts
classes = { "SOME_CLASS" : "SOME_PROMPT" }
classifier.prompt_train(classes)
Or if you have some training examples (optional)
data = ["example 1", "example 2"]
classifier.add_data_to_class("SOME_CLASS", data)
# Don't forget to train after adding data
classifier.prompt_train()
Classify your data
# Classify the data - in a list of string(s)
prediction = classifier.predict(list_of_strings)
# Get the probabilities for each class
probabilities = classifier.predict_proba(list_of_strings)
Save your model
classifier.save("SOME_PATH")
Load your model
classifier = LaminiClassifier.load("SOME_PATH")
The LLM classifier converts your prompts into a pile of data, using the Llama 2 LLM. It then finetunes another LLM to distinguish between each pile of data.
We use several specialized LLMs derived from Llama 2 to convert prompts into piles of training examples for each class. The code for this is available
in the lamini
python package if you want to look at it.
No, this is a week night hackathon project, give us feedback and we will improve it. Some known issues:
- It doesn't use batching aggressively over classes, so training on many classes could be sped up by more than 100x.
- We are refining the LLM example generators. Send us any issues you find with your prompts and we can improve these models.
You don't need to label any data using LaminiClassifier
. Labeling data sucks.
No fiddling with hyperparameters. Fiddle with prompts instead. Hopefully English is easier than attention_dropout_pcts.
A classifier always outputs a valid class. An LLM might answer the question "Is this talking about a cat" with "Well... that depends on ....". Writing a parser sucks.
Added benefit: classifiers give you probabilities and can be calibrated: https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/