ModularQA

ModularQA is a QA system that answers complex multi-hop and discrete reasoning questions by decomposing them into sub-questions answerable by two sub-models: a neural factoid single-span QA model and a symbolic calculator. These sub-questions and answers provided by the sub-model provide a natural language explanation of the model’s reasoning. This system is designed and trained based on the Text Modular Networks framework where the decompositions are generated in the language of the sub-models without needing annotated decompositions. For more details, refer to the paper.

Paper

Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models
Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish Sabharwal
NAACL 2021

Bibtex:

@inproceedings{WhatsMissing19,
  title={Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models},
  author={Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish Sabharwal},
  booktitle={NAACL},
  year={2021}
}

Demo

https://modularqa-demo.apps.allenai.org/

Note that responses might be slow

Data

QA Datasets

We used the following subsets of HotpotQA and DROP to train and evaluate our models

Training Datasets

NextGen Training Data: The decomposition chains generated from these DROP+HotpotQA subsets. These chains were used to train the NextGen model.
Chains Scorer Training Data: The chains generated by running inference using our NextGen model with associated labels: 1 indicates the final answer produced by this chain is correct (F1>0.2) and 0 indicates incorrect.

Predictions

If you want the predictions of the system, without having to run the code, we provide them here:

Models

We also provide the trained models used in our system.

NextGen Model: A BART-Large model trained to produce the next sub-question given the complex question and previous question-answer pairs. Sample input-output:

  Input:
    QC: When did the magazine Wallace Hester work for run? QI: (squad) What magazine did Hester work for? A: "Vanity Fair". QS:

  Output:
    (squad) When did the second Vanity Fair run?

Chains Scorer Model: A RoBERTa-Large model trained to predict whether the final answer produced by an inference chain is correct (captured by the score for the label 1). Sample input:

   QC: How many percent of jobs were not in wholesale? QI: (squad)What percent of jobs are in wholesale? A: 12.4 QI: (math)not(12.4) A: 87.6 QS: [EOQ]

SQuAD QA Model: A RoBERTa-Large QA model trained on SQuAD 2.0.

Code

We use a fork of the HuggingFace Transformers codebase: https://github.com/tusharkhot/transformers/tree/modularqav2. This is based on an older version of the Transformers codebase.

Running Inference

To run inference, follow the following steps:

Clone the github repo from https://github.com/tusharkhot/transformers and checkout the modularqav2 branch.

  git clone https://github.com/tusharkhot/transformers.git modularqa_transformers
  cd modularqa_transformers
  git checkout modularqav2
  export PYTHONPATH=src

Download the HotpotQA and DROP subsets from above and unzip here

  wget https://ai2-public-datasets.s3.amazonaws.com/modularqa/hotpot_subset.zip
  unzip hotpot_subset.zip

  wget https://ai2-public-datasets.s3.amazonaws.com/modularqa/drop_subset.zip
  unzip drop_subset.zip

Save (or download from above) the trained
- NextGen model to nexgen_model/
- Chain scorer to chain_scorer/
- SQuAD 2.0 QA model to qa_model/

If the models are downloaded to a different path, change the paths in the config files below.

Run inference on HotpotQA Dev set, run:

python -u -m modularqa.inference.configurable_inference \
        --input hotpot_subset/dev.json \
        --output predictions_hotpot_dev.json \
        --config modularqa_configs/hotpot_dev_config.json --reader hotpot

To evaluate on the held-out test set, change dev to test in the command above.

To evaluate the DROP set, run:

python -u -m modularqa.inference.configurable_inference \
        --input drop_subset/dev.json \
        --output predictions_drop_dev.json \
        --config modularqa_configs/drop_dev_config.json --reader drop

Similary replace dev with test to evaluate on the held-out test set.

NOTE: These are slow inference steps but highly parallelizable. If you want, you can directly use our predictions available above.

Compute the metrics using the evaluation scripts released with the HotpotQA and DROP dataset. For example:

HotpotQA

  python -m modularqa.evals.evaluate_hotpot_squad_format \
    predictions_hotpot_dev.json hotpot_subset/dev.json

DROP

  python -m modularqa.evals.drop_eval \
    --gold_path drop_subset/dev.json \
    --prediction_path predictions_drop_dev.json \

WuJian1995/modularqa