/cappy

NeurIPS 2023 - Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer

Primary LanguagePythonApache License 2.0Apache-2.0

Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer

This repo contains code of the following paper:

Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer
Bowen Tan, Yun Zhu, Lijuan Liu, Eric Xing, Zhiting Hu, Jindong Chen
NeurIPS 2023
[arXiv] [Model Card (btan2/cappy-large)]

Getting Started

  • Cappy is a pretrained small scorer designed to enhance the performance and efficiency of multi-task LLMs.
  • Cappy takes in an instruction and a candidate response as input, and produces a score between 0 and 1, indicating an estimated correctness of the response with respect to the instruction.
  • With merely 360 million parameters, Cappy functions either independently on classification tasks or serve as an auxiliary component for LLMs, boosting their performance.
  • Also, Cappy enables efficiently integrating downstream supervision without requiring LLM finetuning nor the access to their parameters.
  • Furthermore, Cappy is flexible to cooperate with other LLM adaptations, including finetuning and in-context learning, and prompt tuning, offering additional performance enhancement.

Now, Cappy can be loaded with transformers either as a Jax/Flax model or a PyTorch model.

Jax/Flax

from transformers import AutoTokenizer, FlaxAutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('btan2/cappy-large')
cappy = FlaxAutoModelForSequenceClassification.from_pretrained('btan2/cappy-large')

instruction = """
What label best describes this news article?
Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
"""
response = 'Business'

inputs = tokenizer([(instruction, response), ], return_tensors='pt')
score = cappy(**inputs).logits[0][0].item()

PyTorch

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('btan2/cappy-large')
cappy = AutoModelForSequenceClassification.from_pretrained('btan2/cappy-large')

instruction = """
What label best describes this news article?
Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
"""
response = 'Business'

inputs = tokenizer([(instruction, response), ], return_tensors='pt')
score = cappy(**inputs).logits[0][0].item()

Below are the scripts to recover the experiments in the paper.

Requirements

Cappy's pretraining and finetuning are both based on Redco, a lightweight tool automating distributed training on both GPUs and TPUs.

To install redco

pip install redco==0.4.13

Sometimes the Jax version needs be adjusted based on your device & environment. Here are some instructions.

To install other requirements,

pip install -r requirements.txt

Pretraining Cappy

Cappy's pretraining uses the code from this example in Redco. We will release Cappy's pretraining data soon.

Evaluting Cappy on PromptSource (zero-shot)

Download Test Data

Following the setting from OPT-IML paper (Section 5.2). We conduct zero-shot evaluation on 11 held-out classification tasks from PromptSource.

bash scripts/download_promptsource_test_data.sh

Running Cappy

python cappy_promptsource.py --model_name_or_path btan2/cappy-large

Results

OPT 30B OPT-IML 30B OPT 175B OPT-IML 175B T0 11B Cappy (ours, 0.36B)
ANLI R1 33.7 37.1 34.1 42.2 42.1 34.3
ANLI R2 34.1 35.4 34.1 38.5 37.9 33.9
ANLI R3 34.7 36.6 34.7 39.6 39.7 34.7
CB 24.6 43.2 38.9 56.4 58.5 59.4
RTE 56.4 67.8 54.0 73.4 80.2 71.9
StoryCloze 55.5 90.7 57.0 95.0 96.7 93.7
WSC 43.5 58.2 51.0 59.2 58.6 63.8
WiC 50.8 54.7 49.7 53.6 56.0 51.9
Winogrande 50.2 53.4 50.1 56.6 62.5 51.7
WinoGender 54.9 64.6 53.9 72.7 83.8 68.9
Crows-Pairs 85.5 22.3 85.5 34.4 24.0 57.8
Average 47.6 51.3 49.3 56.5 58.2 56.6

Baseline results come from OPT-IML paper (Section 5.2).

Boosting FLAN-T5 with Cappy on Big-Bench Tasks

Getting Big-Bench Tasks

We take all 45 generative tasks from Big-Bench in our experiment. The command below process the tasks into .jsonl format.

python scripts/get_bigbench_data.py

The processed datasets can be found in ./bigbench_data, where ./bigbench_data/subset_names.json records all the task names.

Getting FLAN-T5 Outputs

We collect generated outputs (as well as log-likelihoods on evaluation sets) from FLAN-T5 models (from -small to -xxl). They can be downloaded with

bash scripts/download_bigbench_flan_gens.sh

If you want to generate outputs by your self and/or adjust some generation settings, we provide generation code as below that supports distributed inference using multiple GPUs together (in case the model is too large to accomodate on a single GPU, e.g., FLAN-T5-XXL (11B)).

python scripts/bigbench_flan_generate.py \
  --model_name_or_path google/flan-t5-xl \
  --n_model_shards 4

where --n_model_shards refers to the number of shards you want to split the large model into (it's usually the number of GPUs on your device if it's not 1).

Adapting Cappy to boost FLAN-T5

XLA_PYTHON_CLIENT_MEM_FRACTION=.95 python cappy_bigbench.py \
  --model_name_or_path btan2/cappy-large \
  --bigbench_subset_name auto_categorization \
  --bigbench_gen_model flan-t5-xxl \
  --train_size 102400
  • XLA_PYTHON_CLIENT_MEM_FRACTION=.95: (In case GPU memory exceeds) adjust the GPU memory pre-allocation to Jax, see here for more details.
  • --bigbench_subset_name: the name of subset from Big-Bench (see ./bigbench_data/subset_names.json for all of them).
  • --bigbench_gen_model: the FLAN model to be boosted.
  • --train_size: the target data size to construct for Cappy's finetuning on the task (collect FLAN outputs, and then truncate or repeat).

See def main(...) in cappy_bigbench.py for all the arguments.

Every sub-task takes 40 mins to run on a single A10G GPU. The result will be logged in ./bigbench_cappy_results/{flan_model}/{subset_name}.json.

Besides, to run all the Big-Bench subsets at once,

python scripts/run_cappy_bigbench.py --cuda_idx 0

Results

To present baseline results, python scripts/present_bigbench_baselines.py

To present Cappy results on all 45 Big-Bench subtasks, python scripts/present_cappy_bigbench_results.py --gen_model_name flan-t5-xxl

The reported numbers on the paper are produced on TPU machines. Here we provide our reproduction results on A10G GPUs in ./bigbench_cappy_results. The gap between them is slight (ΔrougeL <= 0.8).

flan-t5-small flan-t5-base flan-t5-large flan-t5-xl flan-t5-xxl
Beam Search (beam=4) 16.4025 19.8594 23.4802 26.1177 29.6608
Sampling 11.4317 15.7909 19.6248 23.2191 25.7273
Temperature (t=0.9) 12.0126 17.0571 20.0481 24.2702 27.0985
Topk (k=40) 11.5157 15.7481 19.7634 22.6692 25.8226
Nucleus (p=0.95) 11.9171 16.6174 20.1986 24.1654 26.9036
Self-Score (sum) 15.0806 20.711 24.1224 28.4665 32.0156
Self-Score (mean) 16.4223 20.1317 23.7828 26.7694 30.246
Cappy (ours) 23.6543 27.6178 30.3802 33.2775 37.1678

Acknowledgement

Cappy is Mario's ally throughout Super Mario Odyssey and assists him in various ways. We thank Nintendo for the nice game!