/Logical-and-abstract-reasoning

Evaluation on Logical Reasoning and Abstract Reasoning Challenges

Primary LanguagePythonMIT LicenseMIT

Logical and Abstract Reasoning

Repository for the evaluation of Large Language Models on logical and abstract reasoning tasks

Installation

To install the repository, use the following command:

git clone https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.git

To install the dependencies in a virtual environment, use the following:

cd Logical-and-abstract-reasoning
python -m venv env/
source env/bin/activate
pip install -r requirements.txt

You may need to install transformers from the repository:

pip install git+https://github.com/huggingface/transformers

Use

Evaluation

To evaluate a model in the repository, use the following command:

python run_evaluation config/model/<model_config.yaml> config/data/<data_config.yaml> --<kwarg_name> <kwarg>

You can choose the model to evaluate by changing the <model_config.yaml> file, and the dataset to evaluate the model on by changing the <data_config.yaml> file. You can add any additional arguments as <kwargs> (e.g. private API key for GPT models).

By default, all the results are saved in a csv file in the logs/ folder. You can re-compute the metrics from the evaluation run from this file by running the following:

python src/evaluate/evaluator.py logs/<results_file.csv>

Fine-tuning

To fine-tune a model on a given dataset, run the following:

python run_finetuning.py config/model/<model_config.yaml> config/data/<data_config.yaml> config/trainer/<trainer_config.yaml>

The configuration files work similarly as for evaluation. The <model_config.yaml> file contains additoinal configuration for training. The logs are saved in fine-tuning-output/ and the model weights are saved in fine-tuning-saves/.

Currently, only HuggingFace models can be fine-tuned.

LLaMA-based model instruction fine-tuning

We use the LLaMA-based model fine-tuning from the Stanford Alpaca training script. If you want to conduct a LLaMA-based model on instruction fine-tuning, you can do that by following this link.

Models

Inference Type Model Size Task Link Remark
Logical Reasoning on Reading Comprehension MERIt - Reading Comprehension paper
project
#3 on the ReClor leaderboard
LReasoner - Reading Comprehension paper
project
#6 on the ReClor leaderboard
AMR-LE - Reading Comprehension project #2 and #5 on the ReClor leaderboard
LLaMA - Reading Comprehension paper
code
Open source very large language model
LLaMA2 - Reading Comprehension paper
code
Open source very large language model
TinyLLaMA - Reading Comprehension paper
code
Open source very large language model
Alpaca - Reading Comprehension code Fine-tuned LLaMA
Vicuna - Reading Comprehension project
code
Fine-tuned LLaMA
ChatGPT - Reading Comprehension paper
project
Use api to do prompt tuning
GPT-4 - Reading Comprehension paper
project
Use api to do prompt tuning
Zephyr-7b-beta - Reading Comprehension code Fine-tuned Mistral-7b

Datasets & Benchmarks

Inference Type Dataset Size Task Link Remark
Logical Reasoning on Reading Comprehension ReClor - Reading Comprehension paper
project
Logical reasoning reading comprehension
LogiQA - Reading Comprehension paper
project
Logical reasoning reading comprehension
LogiQA V2 - Reading Comprehension project Logical reasoning reading comprehension
LogiQA Logical Reasoning Plus - Reading Comprehension project Logical reasoning reading comprehension for out-of-distribution evaluation
Abstract Reasoning ARC - Abstract Reasoning paper
code
Text version of a Visual Abstract Reasoning task
ACRE - Abstract Reasoning paper
code
Text version of a Visual Abstract Reasoning task
PVR - Abstract Reasoning paper Abstract Reasoning task
RAVEN - Abstract Reasoning paper
project
Text version of a Visual Abstract Reasoning task
Diagrammatic Logic - Abstract Reasoning code Extracted from OpenAI Evals
Logic - Abstract Reasoning code Extracted from OpenAI Evals
Logic Statements - Abstract Reasoning code Extracted from OpenAI Evals
Pattern Identification - Abstract Reasoning code Extracted from OpenAI Evals
String Patterns - Abstract Reasoning code Extracted from OpenAI Evals
List Functions - Abstract Reasoning code Extracted from Google BIG-bench

Acknowledgement

Our proposed new dataset logiqa-logical-reasoning-plus has been merged by OpenAI/Evals.