/question_generation

Neural question generation using transformers

Primary LanguageJupyter NotebookMIT LicenseMIT

Question Generation using 🤗transformers (modded by DammK)

As part of my NLP course project. Included recent fix about using current framework. (See My built cmd)

# cs6493 project (QG task)
# 3.0.2 has incapability with Python 3.9. Hence transformers must be newer then 4.0
conda install -c conda-forge transformers
conda install -c conda-forge nltk
pip install git+https://github.com/Maluuba/nlg-eval.git@master

Target is simple: Make everything runs once again, then include recent qg-hl models. Multiple issues / PRs may be included. To test the scripts, run zz*.ipynb instead. Command lines are included. This repo may serve as a review for SOTA models in QG, finding corner cases which may generate weird results or crash, features / assumptions which has made by original repo, and try enriching the functionality for this repo. "Things have changed greatly in these few years".

Findings / Added features

  • As in what you see in 🤗 model hub, this pipeline will eventually feed the formatted input to the model. However it was originally done in a unsupervised manner (using t5-small-qa-qg-hl to extract answers per sentence), therefore the code is a bit more difficult to read.
  • Extract answers example: t5-small-qa-qg-hl (Original answer was [Serbian, 1856, 1943, alternating current]) (as in notebook zz1c)
> extract answers: <hl> Nikola Tesla (Serbian Cyrillic: Никола Тесла; 10 July 1856 – 7 January 1943) was a Serbian American inventor, electrical engineer, mechanical engineer, physicist, and futurist best known for his contributions to the design of the modern alternating current (AC) electricity supply system. <hl> </s>
> икола есла<sep>
  • Workaround for the case above (answer is not found in context). I further break the word to "hope for" some generated contents. In default case, empty string will be returned. It crashes in original pipeline.
  • Then "answer extraction" can be bypassed by providing sentence with highlight with corrosponding highlight token (<hl> or [HL]) (</s> is optional)
  • Then the pipeline now supports both BART and T5 (from 2 repos).
  • However the actual implementation of BERT is different from online version bart-squad-qg-hl (Online result = What nationality was Nikola Tesla?):
> Nikola Tesla (Serbian Cyrillic: Никола Тесла; 10 July 1856 – 7 January 1943) was a [HL]Serbian[HL] American inventor, electrical engineer, mechanical engineer, physicist, and futurist best known for his contributions to the design of the modern alternating current (AC) electricity supply system.
> What was the nationality of Konnikola tesla? 

TODOs

  • To support more repos, token handling should be based from model name instead of model type. However currently popular / avaliable SOTA models for this tasks are BART and T5 only, Re-current BERT was "stuck in implementation", and ERNIE-GEN uses a different framework which has no pyTorch / 🤗 adaptation yet.
  • zz2. The expected trainning dataset is already incompatable. Follow guides for this repo instead.
  • SQuAD v2.0 support. Plausable answers can be cast directly for QG task, but it is not effective when the trainning task is stuck.

Notebook list

  • zz1: Original notebook in repo. Clear.
  • zz1b: Some corner case which may crash the original model. Clear.
  • zz1c: Fusing with BART, which is not completed in original repo. Now both BART and T5 can be in supervised mode. Clear.
  • zz2: Training and retrive score metric. In progress.
  • zz3: Minimal e2e-qg with score metric. Clear.
  • zz3b: Minimal question-generation with SQuAD dataset. Clear.
  • zz3c (4x): Full SQuAD validation set on small / base model. Clear.
  • zz4a: BART base with supervised highlighted answer. Clear.
  • zz4b: T5 base with supervised highlighted answer. Clear.

Results

As claimed by both repos, all models are trained with SQuAD v1.1. "Base model" is included. Different from forked version. Score is based on full validation set of SQuAD v1.1 in 🤗datasets (formaerly nlp) per context:

  • hyp.txt: Concatenated generated questions.
  • ref1.txt: Original questions.
  • ref2.txt: Original concext. Note that the score is generally higher then what you've seen in web. Their performance should be identical.
Name Highlight BLEU-1 BLEU-2 BLEU-4 METEOR ROUGE-L
t5-base-e2e-qg Supervised 68.6667 53.0235 33.7465 28.5125 32.7107
bart-squad-qg-hl Supervised 67.0877 51.0051 31.2478 26.7013 31.6968
t5-base-e2e-qg Unsupervised 57.8001 47.8133 34.1749 19.0514 35.0973
t5-base-qg-hl Unsupervised 69.8286 53.4806 34.1254 21.7064 34.8645
t5-small-e2e-qg Unsupervised 53.2628 43.6088 30.6282 17.7136 33.5326
t5-small-qg-hl Unsupervised 69.4194 53.1734 33.8424 21.2269 34.0925

Citations

@misc{questiongeneration20,
    author = {Philip Huang},
    title = {Question Generation},
    publisher = {GitHub},
    journal = {GitHub repository},
    year = {2021},
    howpublished={\url{https://github.com/p208p2002/Transformer-QG-on-SQuAD}}
}
@misc{questiongeneration20,
    author = {Suraj Patil},
    title = {Question Generation},
    publisher = {GitHub},
    journal = {GitHub repository},
    year = {2020},
    howpublished={\url{https://github.com/patil-suraj/question_generation}}
}

Relevant papers