This repository contains code for an E2E framework for the generation of cheese descriptions using Large Languange Models (LLMs).
pip install -r requirements.txt
For preparation of training data, we use the OpenAI GPT-3.5 turbo model. The raw data comprising 600 cheese descriptions text files are present in ./Data/raw_data
.
We have prepared a formatted JSON file ./Data/template_english_dict.jsonl
with model lines based on the CDG tool (corresponding to all the moves and steps).
Using this as prior information about the domain, we prepare our own schema with customer slots for each rhetorical section (M1-M6), and store them in ./Data/slots_data
. The files for M1-M6 can be found in the following locations within this directory:
- M1:
identification.txt
- M2:
description.txt
- M3:
process.txt
- M4:
smell_taste.txt
- M5:
service_suggestions.txt
- M6:
quality_assurance.txt
The data synthesis pipeline follows two steps:
-
./Notebooks/data_reallignment.ipynb
: Run this notebook to invoke OpenAI APIs and restructure raw data instances into the 6 rhetorical categories. Update file paths in the notebook for new datasets. This outputs./Data/slots_data/rhet_data.json
-
./Notebooks/slot_filling.ipynb
: Run this notebook, with the output file from (1.). The notebook reads slots_data files for slot definitions, uses realigned text from (1.) and injects into slot_filling prompts. It then invokes OpenAI APIs to auto-fill the slots with values from the text. This outputs./Data/slots_data/rhet_data_slots.json
The notebooks use prompt templates defined in ./Notebooks/templates.py
to create the prompts for querying purposes.
Use ./Notebooks/data_analysis.ipynb
to analyze the prepared data and generate cleaned data after post-processing. The cleaned data is then split into train, validation and test splits using the notebook ./Notebooks/train_val_test_split.ipynb
:
- train_data:
./Data/slots_data/rhet_data_slots_cleaned_train.json
- val_data:
./Data/slots_data/rhet_data_slots_cleaned_val.json
- test_data:
./Data/slots_data/rhet_data_slots_cleaned_test.json
Note:
-
Invoking OpenAI service requires creating an account on the platform and adding credits in USD. Please refer to OpenAI Documentation for details. Once created, an API key is obtained which should be updated within the appropriate notebook cell.
-
Additional notebooks in the
./Notebooks/
folder are experimental and can be ignored.
python train.py --config configs/base.json --generation "<generation_type"
Note: Training requires GPU support (CUDA environment). The code will not run in CPU environment or will be extremely slow.
python test.py --config configs/base.json --generation "greedy"
python test.py --config configs/base.json --generation "beam"
python test.py --config configs/base.json --generation "sampling_topk"
python test.py --config configs/base.json --generation "sampling_topp"
python test.py --config configs/base.json --generation "sampling_topk_topp"
Input template based inferencing can be done by specifying the template file path under "data->test" within the configs. A sample template is already provided under configs/cheese_template.json
. While running test specify the flag --from_template
as shown below:
python test.py --config configs/base.json --generation "<generation_type>" --from_template
-
The config file
./configs/base.json
contains training and inference parameters which should be set appropriately before performing train or inference operations. Thetype
parameter refers to pre-trained model config type that is defined under "model" key within the config. Currently supported pre-trained models arebert-base-uncased
,EleutherAI/gpt-neo-1.3B
,t5-base
andgoogle/flan-t5-base
. The config can be extended when new model code is added. -
Training parameters, including optimizer settings, learning rate, checkpoint paths etc. are defined under "train".
-
For loading pre-trained models to be used for inferencing, set the "checkpoint" paths correctly to refer to the train "save_dir" which was used for fine-tuning.
For developers planning to extend the repository for generating other product descriptions or fine-tuning other pre-trained architectures, here are some details on the source code:
-
collators
directory contains custom collators defined per model, where custom logic for batching data is defined (PyTorch definitions) -
dataset.py
contains pre-defined pipelines for loading prepared data and storing in-memory a data-to-text task data for same. -
pipelines.py
contains model specific data pre-processing code (tokenization, attention masks etc.) -
loaders
contains code for loading pre-trained or fine-tuned checkpoints based on the configs. It creates the models and updates the weights from given checkpoint paths. -
models
contains the source code for different model architectures used within this project. Contains model forward definition. -
train.py & test.py
contains training and testing code (innitialization of dataset, optimizers, models, training/inference pipelines, saving checkpoints, evaluation etc.) -
evaluation
contains quantitative metric definitions for evaluating model performance such as Rouge-1,2,L, Meteor, Bleu-4, BertScore etc. -
Use
Script
folder to store automation shell script files for consecutive training or inference jobs.