ModularQA is a QA system that answers complex multi-hop and discrete reasoning questions by decomposing them into sub-questions answerable by two sub-models: a neural factoid single-span QA model and a symbolic calculator. These sub-questions and answers provided by the sub-model provide a natural language explanation of the model’s reasoning. This system is designed and trained based on the Text Modular Networks framework where the decompositions are generated in the language of the sub-models without needing annotated decompositions. For more details, refer to the paper.
Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models
Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish Sabharwal
NAACL 2021
Bibtex:
@inproceedings{WhatsMissing19,
title={Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models},
author={Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish Sabharwal},
booktitle={NAACL},
year={2021}
}
https://modularqa-demo.apps.allenai.org/
Note that responses might be slow
We used the following subsets of HotpotQA and DROP to train and evaluate our models
-
NextGen Training Data: The decomposition chains generated from these DROP+HotpotQA subsets. These chains were used to train the NextGen model.
-
Chains Scorer Training Data: The chains generated by running inference using our NextGen model with associated labels: 1 indicates the final answer produced by this chain is correct (F1>0.2) and 0 indicates incorrect.
If you want the predictions of the system, without having to run the code, we provide them here:
We also provide the trained models used in our system.
- NextGen Model: A BART-Large model trained to produce the next sub-question given the complex question and previous question-answer pairs. Sample input-output:
Input:
QC: When did the magazine Wallace Hester work for run? QI: (squad) What magazine did Hester work for? A: "Vanity Fair". QS:
Output:
(squad) When did the second Vanity Fair run?
- Chains Scorer Model: A RoBERTa-Large model trained to predict whether the final answer produced by an inference chain is correct (captured by the score for the label 1). Sample input:
QC: How many percent of jobs were not in wholesale? QI: (squad)What percent of jobs are in wholesale? A: 12.4 QI: (math)not(12.4) A: 87.6 QS: [EOQ]
- SQuAD QA Model: A RoBERTa-Large QA model trained on SQuAD 2.0.
We use a fork of the HuggingFace Transformers codebase: https://github.com/tusharkhot/transformers/tree/modularqav2. This is based on an older version of the Transformers codebase.
To run inference, follow the following steps:
- Clone the github repo from https://github.com/tusharkhot/transformers and checkout the
modularqav2
branch.
git clone https://github.com/tusharkhot/transformers.git modularqa_transformers
cd modularqa_transformers
git checkout modularqav2
export PYTHONPATH=src
- Download the HotpotQA and DROP subsets from above and unzip here
wget https://ai2-public-datasets.s3.amazonaws.com/modularqa/hotpot_subset.zip
unzip hotpot_subset.zip
wget https://ai2-public-datasets.s3.amazonaws.com/modularqa/drop_subset.zip
unzip drop_subset.zip
- Save (or download from above) the trained
- NextGen model to
nexgen_model/
- Chain scorer to
chain_scorer/
- SQuAD 2.0 QA model to
qa_model/
- NextGen model to
If the models are downloaded to a different path, change the paths in the config files below.
- Run inference on HotpotQA Dev set, run:
python -u -m modularqa.inference.configurable_inference \
--input hotpot_subset/dev.json \
--output predictions_hotpot_dev.json \
--config modularqa_configs/hotpot_dev_config.json --reader hotpot
To evaluate on the held-out test set, change dev
to test
in the command above.
To evaluate the DROP set, run:
python -u -m modularqa.inference.configurable_inference \
--input drop_subset/dev.json \
--output predictions_drop_dev.json \
--config modularqa_configs/drop_dev_config.json --reader drop
Similary replace dev
with test
to evaluate on the held-out test set.
NOTE: These are slow inference steps but highly parallelizable. If you want, you can directly use our predictions available above.
- Compute the metrics using the evaluation scripts released with the HotpotQA and DROP dataset. For example:
- HotpotQA
python -m modularqa.evals.evaluate_hotpot_squad_format \
predictions_hotpot_dev.json hotpot_subset/dev.json
- DROP
python -m modularqa.evals.drop_eval \
--gold_path drop_subset/dev.json \
--prediction_path predictions_drop_dev.json \