As the final project for CS1470 Deep Learning, we re-implemented Dual Co-Matching Network for multiple choice questions introduced in this paper. The model uses BERT-Small
with 4 layers of encoders and a hidden size of 256, and achieves accuracy around 0.6138. The model definition is stored in model.py
, and main.py
is the script for running the training task. The current model is trained with a subset of the high school questions in the RACE dataset.
Machine reading comprehension (MRC) endows computers with the ability to read, analyze, and summarize text. With the advent of the information age, the scale of text data is exploding. Therefore MRC can greatly benefit various aspects of society by intelligently automating text processing. For example, search engines can use MRC to better understand user queries and return relevant results. When we enter a question into the search engine Google, sometimes Google can directly return the correct answer by highlighting it in the context. Moreover, language educators can employ MRC to automatically evaluate essays; doctors can apply MRC to analyze patients’ symptoms and medical history to perform diagnosis and treatment.
Intrigued by MRC and its wide applications, our group explored many papers that tackled various MRC tasks, including Visual Question Answering (VQA) tasks, multi-modal MRC tasks, etc. Eventually, we decided to reimplement a model that can solve multi-choice reading comprehension questions for the following reasons:
First, multi-choice MRC questions are more challenging than traditional MRC questions for NLP models. While the expected answer of traditional MRC is usually just an extract of the given passage, multi-choice MRC is non-extractive and includes more challenging questions such as commonsense reasoning and passage summarization, where the answer may not appear in the original passage. Second, the model involves using a pre-trained model BERT as an encoder. Since BERT is a well-known model that achieves state-of-the-art results in a wide variety of NLP tasks, we want to use this opportunity to learn about and implement something BERT-based ourselves. Third, the original paper doesn’t have code, so we will take on greater responsibility and have more hands-on coding experience compared to reimplementing a paper that already has code. Fourth, the original paper uses the RACE dataset, which was built from middle school English exams in China. All of our group members are from China and had taken these exams before, so we feel the dataset is very relevant to our educational experience.
Our task is a supervised classification task (from the four candidate choices). We believe this task will be exciting because it involves both the typical NLP workflow of cleaning & tokenizing data and the DL research workflow of proposing novel model structures that complete specific tasks better. Much of the architectures proposed by the authors are quite intuitive, so we would love to see if they actually provide meaningful performance improvements.
We use a dataset called RACE (Lai et al. 2017), which is recognized as one of the largest and most difficult datasets in multi-choice reading comprehension. It is collected from the English exams for middle and high school Chinese students, and consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors) It covers a variety of topics carefully designed for evaluating the students’ ability in understanding and reasoning. Each passage corresponds to multiple questions (usually 3 or 4), and each question has 4 options to choose from.
We use a subset of the RACE dataset (initially 1000 passages - ~3200 questions and later 5000 passages ~20000 questions) to speed up training. Although a larger dataset would have helped, it would proportionally slow down training and our short timeframe limited us in this respect.
Our model consists of the following components: Contextualized Encoding We use a pre-trained model (BERT) as the encoder of our model to obtain the sequence representation of passage, question, and answer options. A sequence embedding is generated for each of the passage, question, and options, and the three sequence embeddings are used in subsequent layers. Bidirectional Comatching We now have: Sequence representation of question from part I (Q) Passage sentence selection (P) Answer option interaction (O) We then build bidirectional matching to get all pairwise representations among the triplet. Bidirectional matching is a form of attention. When we match (P, Q), for example, we are essentially returning the attention of P on Q. We compute this attention from both directions and end up with 6 sets of attention value vectors. Pooling & Gating We first max-pool the sequence of attention vectors into one max-pooled vector, similar to BERT’s max-pooled output. To build the representation of bidirectional attention between a pair (P, Q), we use a sigmoid gate to decide the remember/forget ratio of each element in P and Q. This step results in 3 output vectors each representing interaction between a pair among P, Q, and O. Classification Finally, we use a linear layer with softmax activation to perform classification.
The model outputs a softmax vector of length 4, and we use categorical cross entropy for the training loss function. The accuracy function is simply categorical accuracy, where acc = no. correct/total no.
There are two major challenges (and of course, a lot more small ones) Even though BERT is on the smaller side for modern language models (110M params compared to 175B for BLOOM!), it is still very resource-consuming especially considering the size of our passages. After realizing that training a full-sized BERT is out of reach, we turned to various versions of distilled BERT. We experimented with a few and ended up using one with 4 layers and 256 embedding size, compared to 12/768. This sped up training significantly. The paper we implement provides only a high-level overview of the structure of the dual do-matching network (DCMN), and a lot of details are missing. Many dimensions are unspecified: to start with, should we concatenate four options together using separators or should we pass them in separately? We ended up trying both, and the same goes for some other implementation details.
In addition, one thing that must be brought up is the disastrous situation of compatibility issues in tensorflow. We avoided using any non-tensorflow third-party packages and resources, but even then we frequently ran into incompatibility issues when loading models and many of these are not documented or not consistent with official documentation - our final model runs with an error where the BERT pre-trained by Google (the most authoritative developer!) has a feature that will be deprecated. The group feels a sense of relief when the project & course is done, and they know that they can start using Pytorch.
As the RACE dataset is exclusively made of multiple choice problems that has 4 choices, an at-chance accuracy would be around 25%. We were happy to see that most model architectures we experimented with can easily train to accuracy above 30%, demonstrating that the model is at least capable of capturing some relationships.
In our initial 10 epoch training tests (which takes around 10 minutes for each model), we found the 4-layer-256-embed_size model, BERT_small, to be significantly better than the 2-layer-128 BERT_tiny (0.394 vs 0.295 accuracy), while the 4-layer-128 model does not show much improvement from the 2-layer-128 model (0.295 vs 0.32). This is quite natural, as the 256 embed size model also has a standard sequence length of 512, which is about enough for computing the embedding of passages, but the smaller model only supports a max sequence length of 128. We ended up using a 4-layer BERT_small because our hardware does not allow larger models, and it has a good enough accuracy. We trained this model for 80 epochs in 10/20 epoch iterations, and achieved a final testing accuracy of 0.6138. Since each epoch takes minutes, we were unable to do an extensive hyperparameter search. Better hyperparameters/different classification layers could have helped, since multiple places suggest that distilled BERTs have performance quite close to their larger counterparts.
While this result is around 6% lower than the accuracy achieved by the original authors (0.67), we are fairly happy with the result because we are using a BERT model that is multiple times smaller and training on a smaller dataset. Our over 60% accuracy is already better than a few other earlier benchmarks on the RACE dataset in 2018&2019. We cannot visualize our model meaningfully since it’s pure text, but here’s a few of our model’s answers, please check it out! After observing a handful sample passages we believe our model performs better when the key information is explicitly present in the passage, which is understandable. However, this insight is purely observed by printing predictions out and might not be generalizable to the entire dataset.
Original Paper:
@inproceedings{dcmn,
title={DCMN+: Dual Co-Matching Network for Multi-choice Reading Comprehension},
author={Shuailiang Zhang and Hai Zhao and Yuwei Wu and Zhuosheng Zhang and Xi Zhou and Xiang Zhou},
year={2020},
booktitle = "{The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI)}",
}
Dataset:
@article{lai2017large,
title={RACE: Large-scale ReAding Comprehension Dataset From Examinations},
author={Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard},
journal={arXiv preprint arXiv:1704.04683},
year={2017}
}