/PTEC

Code repository corresponding to the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation" (NAACL 2024).

Primary LanguageJupyter NotebookMIT LicenseMIT

Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation

InstallationReproducibilityUsagePaperNAACL 2024 PresentationBlog PostCitation

⚠️ This repository has migrated ⚠️

For an up to date codebase, issues, and pull requests, please continue to the new repository. This repository will not be maintained any further, and issues and pull requests may be ignored.

This repository contains the code accompanying the paper "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation". We also recommend to read our blog post "How EQT Motherbrain uses LLMs to map companies to industry sectors".

Installation

After cloning this repository, the necessary packages can be installed with:

pip install -r requirements.txt
pip install -e .

# if using a vertex ai notebook with CUDA
pip3 install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 --no-cache-dir

Reproducibility

All experiments, including hyperparameter search, can be reproduced by running the following batch files:

bash preprocessing/preprocessing.sh
bash sectors/experiments/run_experiments_gpu.sh
bash sectors/experiments/run_experiments_cpu.sh

Usage

The scripts can also be run individually:

Preprocessing

The preprocessed data for the hatespeech dataset is already contained in this repository. However, it can be rerun with

python preprocessing/get_dataset.py
python preprocessing/preprocess_data.py # this line will take ~10 min as it summarizes long descriptions and keyword lists

The preprocessed dataset can be augmented by applying paraphrasing with vicuna:

python preprocessing/paraphrase_augmentation.py

This will create a new dataset data/[DATASET]/train_augmented.json.

Running The Experiments

For test runs, all the following commands include the --model_name=bigscience/bloom-560m flag, as this can easily be run on a cpu. However, it can also be replaced with other huggingface hosted LLaMa or Bloom models. By default it uses huggyllama/llama-7b. All experimental results will be saved as json files in the results/[DATASET]/ directory.

N-shot experiments

python sectors/experiments/nshot/nshot.py --model_name bigscience/bloom-560m

In order to use gpt-3.5-turbo as a model for n-shot prompting, a .env file with the OpenAI API credentials needs to be added to the root directory of this repository:

OPENAI_SECRET_KEY = "secret key"
OPENAI_ORGANIZATION_ID = "org id"

Embedding Promximity

For these experiments, the embeddings still have to be generated by running the following code

python embedding_proximity/generate_embeddings.py --model_name bigscience/bloom-560m
# for augmented data
python embedding_proximity/generate_embeddings.py --model_name bigscience/bloom-560m --augmented augmented

Then, the following code runs all embedding proximity experiments:

python embedding_proximity/vector_similarity.py --model_name bigscience/bloom-560m
python embedding_proximity/vector_similarity.py --model_name bigscience/bloom-560m --augmented augmented

python embedding_proximity/vector_similarity.py --type RadiusNN --model_name bigscience/bloom-560m
python embedding_proximity/vector_similarity.py --type RadiusNN --model_name bigscience/bloom-560m --augmented augmented

python embedding_proximity/classification_head/classification_head.py --model_name bigscience/bloom-560m
python embedding_proximity/classification_head/classification_head.py --model_name bigscience/bloom-560m --augmented augmented

Prompt Tuning

python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --interrupt_threshold 0.01 --augmented augmented

PTEC

python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01
python prompt_tuning/prompt_tune.py --model_name bigscience/bloom-560m --head ch --scheduler exponential --interrupt_threshold 0.01 --augmented augmented

Other Resources

For an example of applying Trie Search, see notebooks/constrained_beam_search.ipynb

Citation

If you use or refer to this repository in your research, please cite our paper:

BibTeX

@inproceedings{buchner2023prompt,
  title={Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation},
  author={Buchner, V. L. and Cao, L. and Kalo, J.-C. and von Ehrenheim, V.},
  booktitle={to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year={2024}
}

APA

Buchner, V. L., Cao, L., Kalo, J.-C., & von Ehrenheim, V. (2024). Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation. to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)

MLA

Buchner, Valentin Leonhard, et al. "Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation." to appear In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.