/P2MCQ

The codebase for NAACL-2022 special theme submission [Towards Process-Oriented, Modular, and Versatile Question Generation that Meets Educational Needs]

Primary LanguagePython

Towards Process-Oriented, Modular, and Versatile Question Generation that Meets Educational Needs

Codebase and pre-trained models for NAACL-2022 submission Towards Process-Oriented, Modular, and Versatile Question Generation that Meets Educational Needs by Xu Wang, Simin Fan, Jessica Houghton and Lu Wang.


P2MCQ Dataset

The P2MCQ dataset archives 160 multiple-choice 307 questions with 629 question options in total (197 correct answers and 432 incorrect answers or distractors) from HCI-101 course. The dataset could be downloaded here.

Data Preprocessing

Set up

# Suggested: create a virtual environment
conda create -n p2mcq python=3.8
conda activate p2mcq

# Requirement
pip install -r requirements.txt

Parsing PDF Document

As for the PDF document preprocessing, we first use scipdf-parser to parse the PDF into sections in plain text format.

To keep the parser running, make sure the GROBID is running backend by executing the following commands in your command line before processing your custom data:

pip install git+https://github.com/titipata/scipdf_parser

git clone https://github.com/titipata/scipdf_parser.git

bash /scipdf_parser/serve_grobid.sh

You can process your own pdf-document with the code:

python /Data/preprocessing.py --pdf_path <path2pdf_doc> --save_path <path to save processed data> --save_format <save format, default as csv>

The pdf_path could be the path on your local file directory, or a public accessible link (e.g. https://arxiv.org/pdf/1908.08345.pdf )

Task1. Make input for Neural-based Sentence Selection

We follow the extractive summarization methodology introduced by (Liu and Lapata, 2019) to select salient sentences from the give paragraph.

python /Data/task1.py --input_path <path to input passages> --src_write_into <path to save processed input> --tgt_path <path to target summary (not required)> --tgt_write_into   <path to save processed target>

Modularized Automatic Models

We propose a list of on-the-shelf and fine-tuned models for the purpose of modularizing the end-to-end MCQ generation process. The subtasks include [T1-sentence selection]; [T2-Abstractive Paragraph Summarization]; [T3-Sentence Simplification]; [T4-Paraphrasing]; [T5-Negation Generation].

task Instruction Reference
Sentence Selection (i.e. extractive summarization) BertSUMEXT The implementation is based on the original codebase released by Liu and Lapata
Abstractive Summarization BertSUMEXTABS Bart-HCI
Sentence Simplification ACCESS MUSS The implementation is based on the original codebase(ACCESS MUSS) released by Martin et al.
Paraphrasing Bart-para-SCI Finetuned on ParaSCI by Dong et al.
Negation CrossAUG The implementation is based on the original codebase released by Lee et al.

Evaluation

The quality of the generated texts is evaluated with BLEU, ROUGE-1, ROUGE-2 and ROUGE-L scores. The references are supposed to be provided.

python ./evaluation.py --input_path <input_filepath (txt)> --pred_path <pred_filepath (txt)> --gold_path <gold_filepath (txt)>

Potential Pitfall

  1. If you see the following error message

    oserror: libcublas.so.10: cannot open shared object file: no such file or directory

    Check whether your torch and cuda version is compatible with your operating system. You can check your CUDA version by nividia-smi.