/unifiedqa-tjh

UnifiedQA: Crossing Format Boundaries With a Single QA System

Primary LanguagePython

UnifiedQA

You may want to check out:

Using the models in PyTorch/HuggingFace

You can very easily load the models with Transformers >=3.1, instead of downloading them manually. The models are listed on this page.

Here is an examples:

from transformers import AutoTokenizer, T5ForConditionalGeneration

model_name = "allenai/unifiedqa-t5-small" # you can specify the model size here
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def run_model(input_string, **generator_args):
    input_ids = tokenizer.encode(input_string, return_tensors="pt")
    res = model.generate(input_ids, **generator_args)
    return tokenizer.batch_decode(res, skip_special_tokens=True)

For instance, here is how you can use it to answer a multiple-choice question:

run_model("which is best conductor? \\n (a) iron (b) feather")

which gives: ['iron']

run_model("scott filled a tray with juice and put it in a freezer. the next day, scott opened the freezer. how did the juice most likely change? \\n (a) it condensed. (b) it evaporated. (c) it became a gas. (d) it became a solid.")

which produces: ['it condensed.'].

Note that you can also pass in the arguments for text generation to the run_model(.) function:

run_model("which is best conductor? \\n (a) iron (b) feather (c) wood (d) plastic",
         temperature=0.9, num_return_sequences=4, num_beams=20)

Feeding data into UnifiedQA

Datasets should be converted into a textin/text-out format.

  • Question always comes first.
  • We use \n separators between different parts of the input. This ensures having a humanlike encoding while not making it overly-specific to a certain format. Note that this separator isn't the newline character (which it looks suspiciously like), but rather backslash-n.
  • Make sure the whole input is correctly pre-processed (e.g., lower-cased)

Here are several examples:

Dataset SQuAD 1.1 (extractive QA)
Encoded Input At what speed did the turbine operate? \n (Nikola_Tesla) On his 50th birthday in 1906, Tesla demonstrated his 200 horsepower (150 kilowatts) 16,000 rpm bladeless turbine. ...
Encoded Output 16,000 rpm
Dataset NarrativeQA (Abstractive QA)
Encoded Input What does a drink from narcissus's spring cause the drinker to do? \n Mercury has awakened Echo, who weeps for Narcissus, and states that a drink from Narcissus's spring causes the drinkers to ''Grow dotingly enamored of themselves.'' ...
Encoded Output fall in love with themselves
Dataset ARC-challenge (Multiple-choice QA)
Encoded Input What does photosynthesis produce that helps plants grow? \n (A) water (B) oxygen (C) protein (D) sugar
Encoded Output sugar
Dataset MCTest (Multiple-choice QA)
Encoded Input Who was Billy? \n (A) The skinny kid (B) A teacher (C) A little kid (D) The big kid \n Billy was like a king on the school yard. A king without a queen. He was the biggest kid in our grade, so he made all the rules during recess. ...
Encoded Output The big kid
Dataset BoolQ (Yes-no QA)
Encoded Input Was America the first country to have a president? \n (President) The first usage of the word president to denote the highest official in a government was during the Commonwealth of England ...
Encoded Output no

If you wanna see how this encoding is done on our datasets, check out this script.

The datasets/tasks used in the experiments

While the datasets we used are all public, it could be a bit time-confusing to convert them all into text-to-text format. We're releasing the already-proccessed text-to-text datasets based on the encoding used in this work. Files are included in this Google Cloud bucket. Here is the script we used in order to convert each dataset into text-in-text-out format.

Prediction files

We're making the predictions of the many of our models available. [To be updated]

Released Model Checkpoints

If you intend to create a QA system, you can use our QA-specialized models for your purpose:

T5 models

Note: In the experiments reported in our paper we always used the checkpoint closest to 100k steps (it usually corresponds to checkpoint 1100500)

You can use these in two ways:

  • If you don't have any training data, you can use them for the evaluation.
  • If you training data, you can use them as your initial models and fine-tune on them.

For more details see the T5 repository.

BART models

The BART models are downloaded from this link (3.6G). For detailed instructions on running the code (training/finetuning/testing), please refer to here. The uncased models usually gave us better and more robust results.

FAQ

I am not getting the expected results. An common issue with using UnifiedQA is making sure you use the separator (\n) when encoding encoding your inputs. See the earlier section where we delineate how to encode the inputs.

Help! I am getting the following error! See this discussion if you're getting the following error:

ValueError: Configurable 'make_layer_stack' doesn't have a parameter named 'use_universal_transformer'.
  In file "gs://danielk-files/t5-models/union_mixture/11B/operative_config.gin", line 83

How to cite

If you extend or use this work, please cite the paper:

@article{2020unifiedqa,
    title={UnifiedQA: Crossing Format Boundaries With a Single QA System},
    author={D. Khashabi and S. Min and T. Khot and A. Sabhwaral and O. Tafjord and P. Clark and H. Hajishirzi},
    journal={EMNLP - findings},
    year={2020}
}