/Simple-QA-EMNLP-2018

Code for my EMNLP 2018 paper "SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach"

Primary LanguageJupyter Notebook

Simple Question Answering — EMNLP 2018

This is the code for the EMNLP 2018 paper "SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach".

On the SimpleQuestions dataset task, one of the most commonly used benchmarks for studying single-relation factoid questions, we:

  1. Show that ambiguity in the data bounds performance on this benchmark at 83.4%; there are often multiple answers that cannot be disambiguated from the question alone.
  2. Introduce a baseline that sets a new state-of-the-art performance level at 78.1% accuracy, using only standard methods.

Example

Preview of the software

Structure

.
├── /notebooks/                          
│   ├── /Simple QA End-To-End/           # Experiments on components of the end-to-end QA pipeline
│   ├── /Simple QA Models                # Experiments on various neural models
│   ├── /Simple QA KG to PostgreSQL DB   # Scripts to populate postgreSQL
│   ├── /Simple QA Numbers               # Scripts for computing and verifying various numbers
├── /pretrained_models/                   
├── /lib/                                # Various utility functionality
├── /tests/                               
├── .flake8                               
└── requirements.txt                     # Required python packages

Prerequisites

This repository requires Python 3.5 or greater and PostgreSQL.

Installation

  • Clone the repository and cd into it
git clone https://github.com/PetrochukM/Simple-QA-EMNLP-2018.git
cd Simple-QA-EMNLP-2018
  • Install the required packages
python -m pip install -r requirements.txt
  • Create and populate a PostgreSQL table named fb_two_subject_name with notebooks/Simple QA KG to PostgreSQL DB/fb_two_subject_name.csv.gz

  • Create a .pass file using the below template:

    DB_NAME=
    DB_PORT=
    DB_USER=
    DB_HOST=
    DB_PASS=
    

    Such that:

    • DB_NAME: the database name
    • DB_USER: user name used to authenticate
    • DB_PASS: password used to authenticate
    • DB_HOST: database host address
    • DB_PORT: connection port number (typically 5432)
  • Download the SimpleQuestions v2 dataset from Facebook Research. Use the notebook at Simple-QA-EMNLP-2018/notebooks/Simple QA KG to PostgreSQL DB/FB5M & FB2M KG to DB.ipynb to create and populate a PostgreSQL table.

  • You're done! Feel free to run Simple-QA-EMNLP-2018/notebooks/Simple QA End-To-End.

Slides

The slides used for our EMNLP talk.

Citation

@article{Petrochuk2018SimpleQuestionsNS,
  title={SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach},
  author={Michael Petrochuk and Luke S. Zettlemoyer},
  journal={CoRR},
  year={2018},
  volume={abs/1804.08798}
}

Important Notes

  • The FB2M and FB5M subsets of Freebase KG can complete 7,188,636 and 7,688,234 graph queries respectively; therefore, the FB5M subset is 6.9% larger than the FB2M subset. Also, the FB5M dataset only contains 3.98M entities. This contradicts the statement that "FB5M, is much larger with about 5M entities" (Bordes et al., 2015).
  • FB5M and FB2M contain 4,322,266 and 3,654,470 duplicate grouped facts respectively.
  • FB2M is not a subset of FB5M, 1 atomic fact is in FB2M that is not in FB5M: (01g4wmh, music/album/acquire_webpage, 02q5zps).
  • FB5M and FB2M do not contain the answer for 24 and 36 examples in SimpleQuestions dataset respectively; therefore, those examples are unanswerable.

Other Important Papers

Other Important GitHub Repositories