BlueBrain/Search

Question-Answering: collect example questions + run first analysis with pre-trained QA models

FrancescoCasalegno opened this issue · 4 comments

  • Contact scientists and collect questions that should be run against Blue Brain Search.
  • Run first analysis of predictions obtained with pre-trained QA models.

QA Dataset

Link to spreadsheet with QA dataset here:
https://bbpgitlab.epfl.ch/ml/search-qa-data

The spreadsheet contains the following information.

  • Questions: from scientists
  • Contexts: manually picked by me using Google from scientific papers
  • Answers: manually extracted by me

Current status:

  • 70 questions-answer pairs collected from 3 scientists

To do:

  • Ask scientists to double check dataset
  • Collect more samples from knowledge graph

Dataset structure

  • Questions collected from 3 scientists: HM, WvG, PS
  • Contexts picked from scientific literature
  • Answers annotated by hand (only 1 answer per question was labeled)
  • 80 context-questions samples in total -- ratio of answerable to non answerable questions is ~2:1, similar to what done for SQuAD v2 dataset.

First results

  • We evaluated different pre-trained QA models on this datasets, to get an idea of how far we are from solving our task:
  • Notice that the questions in our dataset are hard and really far from having an obvious solution even for a human. Moreover, in reality more than one answer could be acceptable (e.g. "about 36 mV", "36 mV", "36 mV measured at steady state" could all be acceptable answers) but for sake of simplicity we only have one ground truth answer per sample.
  • The first analysis done was to compare answerable and unanswerable questions in ground truth vs. predicted answers. We can clearly see that dmis-lab/biobert-large-cased-v1.1-squad is the model doing worst in this sense; this can be explained by the fact that it was trained on SQuAD instead of SQuAD v2, and only this latter includes unanswerable questions.
  • Then, we looked at how these models performed when evaluated using the two metrics that are most commonly used for QA tasks like SQuAD: EM and F1 (see here for the definitions).
    It seems that these models are doing pretty much similarly, but models pre-trained on biology data (biobert and BioM) and the fine-tuned on SQuAD v2 seem to achieve the best performance. In particular, sultan/BioM-ELECTRA-Large-SQuAD2 seems to be the best model, with some margin from the other models. Note that these are results on a small dataset and only one answer per question was annotated, so it's too early to draw any definitive conclusion.
  • In particular, it can be interesting to look at the performance of these models on answerable vs. unanswerable questions. As noticed above, dmis-lab/biobert-large-cased-v1.1-squad performs really bad on unanswerable questions, while for the other models it seems that performance on answerable vs. unanswerable questions is comparable, and sultan/BioM-ELECTRA-Large-SQuAD2 seems to be consistently doing better.
  • Finally, we can inspect more closely the predictions of this model. We sorted the samples in the dataset by increasing F1 score (i.e. worst to best) and we also included the top answers according the model, to check if maybe the second or third most likely predicted answer would have been the correct one. All results are shown here: qa-samples-predictions.pdf

Reviewers

Can you please have a look at these results and provide any feedback or request for more analyses?
Thanks in advance!