/ita-bench

A collection of Italian benchmarks for LLM evaluation

Primary LanguagePythonApache License 2.0Apache-2.0

ITA-Bench 🤖🇮🇹

This is the Sapienza NLP GitHub repository for ITA-Bench (Italian Benchmarks), a benchmark suite for the evaluation of Large Language Models (LLMs) on the Italian language. ITA-Bench is designed to evaluate the performance of LLMs on a variety of tasks, including question answering, commonsense reasoning, mathematical capabilities, named entity recognition, reading comprehension, and others.

Datasets included in ITA-Bench

ITA-Bench includes a variety of datasets for evaluating LLMs on Italian. These datasets are collected from various sources and cover a wide range of tasks.

Note

All the datasets are available on 🤗 Hugging Face Datasets!

The datasets are divided into two main categories:

  1. 🌐 Translations: These datasets are translations of existing English datasets into Italian. They are used to evaluate the performance of LLMs on tasks that have been previously studied in the English language, allowing for a direct comparison between models trained on different languages.

    • Pros: Translations allow for a direct comparison between models trained on different languages
    • Cons: Translations may introduce biases or errors that are not present in the original dataset
  2. 🔨 Adaptations: These datasets are converted from existing Italian datasets into a format that can be used to evaluate LLMs. They are used to evaluate the performance of LLMs on tasks that may be more specific to the Italian language.

    • Pros: The original datasets are already in Italian, so there is no need for translation that may introduce errors
    • Cons: These datasets were not originally designed for evaluating LLMs and the adaptation process may introduce biases or errors

ITA-Bench currently includes the following datasets:

Dataset Task Type Description
ARC-Challenge QA 🌐 Translation Commonsense and scientific knowledge
ARC-Easy QA 🌐 Translation Commonsense and scientific knowledge
BoolQ QA + passage 🌐 Translation Boolean questions
GSM8K QA 🌐 Translation Simple math word problems
Hellaswag Completion 🌐 Translation Commonsense reasoning
MMLU QA 🌐 Translation Advanced questions on 57 subjects
PIQA QA 🌐 Translation Physical interactions reasoning
SciQ QA + passage 🌐 Translation Scientific reading comprehension
TruthfulQA QA 🌐 Translation Questions on Web misconceptions
WinoGrande Completion 🌐 Translation Commonsense reasoning
AMI QA 🔨 Adaptation Misoginy detection
Discotex Completion 🔨 Adaptation Commonsense and world knowledge
Ghigliottinai QA 🔨 Adaptation Guess the missing concept
NERMUD NER 🔨 Adaptation Named entity recognition
PreLearn QA 🔨 Adaptation Reasoning about concept relationships
PreTens QA 🔨 Adaptation Reasoning about concept relationships
QuandHO QA 🔨 Adaptation Reading comprehesion
WiC QA 🔨 Adaptation Word sense disambiguation

How to use ITA-Bench

ITA-Bench is designed to be easy to use and flexible. You can evaluate any LLM on the included datasets using the lm_eval command-line tool. The tool supports a variety of options to customize the evaluation process, including the ability to specify the LLM model, the number of few-shot examples, and the tasks to evaluate.

Before you start

We always recommend using a virtual environment to manage your dependencies, e.g., using venv or conda. To create a new environment with conda, you can run:

# Create a new environment with Conda
conda create -n ita-bench python=3.10

# Always remember to activate the environment before running any command!
conda activate ita-bench

Note

You can read more about managing environments with Conda in the official documentation.

Evaluating an LLM on ITA-Bench

To use ITA-Bench, you can follow these steps:

  1. Clone this repository:
git clone git@github.com:SapienzaNLP/ita-bench.git
cd ita-bench
  1. Install the required packages:
pip install -r requirements.txt
  1. Run the evaluation script:
lm_eval \
  --model hf \
  --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16 \
  --num_fewshot 0 \
  --log_samples \
  --output_path outputs/ \
  --tasks itabench_trans_it-it,itabench_adapt_cloze,itabench_adapt_mc \
  --include tasks

This command will evaluate meta-llama/Meta-Llama-3.1-8B-Instruct on all the benchmarks in our suite. The results will be saved in the outputs/ directory.

Running the evaluation on multiple GPUs

If you have multiple GPUs available, you can use the accelerate command to run the evaluation on multiple GPUs:

accelerate launch -m lm_eval \
  --model hf \
  --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16 \
  --num_fewshot 0 \
  --log_samples \
  --output_path outputs/ \
  --tasks itabench_trans_it-it,itabench_adapt_cloze,itabench_adapt_mc

Note

You can read more about accelerate in the official documentation.

Contributing

We welcome contributions to ITA-Bench!

License

The code in this repository is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.

However, the datasets included in ITA-Bench may have different licenses. Please refer to the original datasets for more information about their licenses.

Publication and citation

Coming soon: a paper on our benchmark suite is under review. Stay tuned for updates!

Acknowledgements

  • Future AI Research for supporting this work.
  • CINECA for providing computational resources.
  • Unbabel for building Tower-LLM.
  • Thanks to the authors of the original datasets for making them available.
  • Thanks to all the Multilingual Natural Language Processing course students of the Master's of Engineering in Computer Science (Dipartimento di Ingegneria Informatica, Automatica e Gestionale, DIAG) of Sapienza University of Rome for their help in adapting some datasets.