/HPT

code for Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

Primary LanguagePythonApache License 2.0Apache-2.0

Logo

Hierarchical Prompting Taxonomy

Paper · Documentation · Leaderboard
A Universal Evaluation Framework for Large Language Models
Table of Contents
  1. News
  2. Introduction
  3. Demo
  4. Installation
  5. Usage
  6. Datasets and Models
  7. Benchmark Results
  8. References
  9. Contributing
  10. Cite Us

News

[06-18-24] HPT is published ! Check out the paper here.

↑ Back to Top ↑

Introduction

Hierarchical Prompting Taxonomy (HPT) is a universal evaluation framework for large language models. It is designed to evaluate the performance of large language models on a variety of tasks and datasets assigning HP-Score for each dataset relative to different models. The HPT employs Hierarchical Prompt Framework (HPF) which supports a wide range of tasks, including question-answering, reasoning, translation, and summarization. It provides a set of pre-defined prompting strategies tailored for each task based on its complexity. Refer to paper at : https://arxiv.org/abs/2406.12644

HPT

Features of HPT

  • Universal Evaluation Framework: HPT is a universal evaluation framework that can support a wide range of datasets and LLMs.
  • Hierarchical Prompt Framework: HPF is a set of prompting strategies tailored for each task based on its complexity employed by the HPT. HPF is made available in two modes: manual and adaptive. Adaptive HPF selects the best prompting strategy for a given task adaptively by a LLM (prompt-selector).
  • HP-Score: HPT assigns an HP-Score for each dataset relative to different agents(including LLMs and humans). HP-Score is a measure of the capability of an agent to perform a task related to a dataset. Lower HP-Score indicates better performance over the dataset.

↑ Back to Top ↑

Demo

Refer to examples directory for using the framework on different datasets and models.

↑ Back to Top ↑

Installation

Cloning the Repository

To clone the repository, run the following command:

git clone https://github.com/devichand579/HPT.git

↑ Back to Top ↑

Usage

Linux

To get started on a Linux setup, follow these setup commands:

  1. Activate your conda environment:

    conda activate hpt
  2. Navigate to the main codebase

    cd HPT/hierarchical_prompt
  3. Install the dependencies

    pip install -r requirements.txt
  4. Add your Hugging Face token

    • Create a .env file in the conda environment
    HF_TOKEN = "your HF Token"
  5. To run both frameworks, use the following command structure

    bash run.sh method model dataset [--thres num]
    • method

      • man
      • auto
    • model

      • llama3
      • phi3
      • gemma
      • mistral
    • dataset

      • boolq
      • csqa
      • iwslt
      • samsum
    • If the datasets are IWSLT or SamSum, add '--thres num'

    • num

      • 0.15
      • 0.20
      • or higher thresholds apart from our experiments.
    • Example commands:

      bash run.sh man llama3 iwslt --thres 0.15
      bash run.sh auto phi3 boolq 

↑ Back to Top ↑

Datasets and models

HPT currently supports different datasets, models and prompt engineering methods employed by HPF. You are welcome to add more.

Datasets

  • Question-answering datasets:
    • BoolQ
  • Reasoning datasets:
    • CommonsenseQA
  • Translation datasets:
    • IWSLT-2017 en-fr
  • Summarization datasets:
    • SamSum

Models

  • Language models:
    • Llama 3 8B
    • Mistral 7B
    • Phi 3 3.8B
    • Gemma 7B

Prompt Engineering

  • Role Prompting [1]
  • Zero-shot Chain-of-Thought Prompting [2]
  • Three-shot Chain-of-Thought Prompting [3]
  • Least-to-Most Prompting [4]
  • Generated Knowledge Prompting [5]

↑ Back to Top ↑

Benchmark Results

The benchmark results for different datasets and models are available in the leaderboad.

↑ Back to Top ↑

References

  1. Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., & Zhou, X. (2023). Better Zero-Shot Reasoning with Role-Play Prompting. ArXiv, abs/2308.07702.
  2. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. ArXiv, abs/2205.11916.
  3. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E.H., Xia, F., Le, Q., & Zhou, D. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv, abs/2201.11903.
  4. Zhou, D., Scharli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Bousquet, O., Le, Q., & Chi, E.H. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ArXiv, abs/2205.10625.
  5. Liu, J., Liu, A., Lu, X., Welleck, S., West, P., Le Bras, R., Choi, Y., & Hajishirzi, H. (2021). Generated Knowledge Prompting for Commonsense Reasoning. Annual Meeting of the Association for Computational Linguistics.

↑ Back to Top ↑

Contributing

This project aims to build open-source evaluation frameworks for assessing LLMs and other agents. This project welcomes contributions and suggestions. Please see the details on how to contribute.

If you are new to GitHub, here is a detailed guide on getting involved with development on GitHub.

↑ Back to Top ↑

Cite Us

If you find our work useful, please cite us !

@misc{budagam2024hierarchical,
      title={Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models}, 
      author={Devichand Budagam and Sankalp KJ and Ashutosh Kumar and Vinija Jain and Aman Chadha},
      year={2024},
      eprint={2406.12644},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}

↑ Back to Top ↑