Status:

Error analysis completed for multiway attention model and RoBERTa (base) model (link)
Please check out AdityaGovardhan/transformer GitHub repository for modifications related to RoBERTa.
random baseline model provided by Allen AI is setup, ran and tested (Google Colab link)
Lifu Huang et al.'s multiway attention model is setup, ran and tested (Google Colab link)
RoBERTa baseline model setup, ran and tested (Google Colab link)
RoBERTa (large) and BERT (large) models explored, efforts discarded due to memory issues
Grid search performed on RoBERTa (base) model to obtain optimal baseline parameters, accuracy improved from 65.92% to 68.87%, this Google Sheets doc has the results of grid search
SocialIQA approach TODO Kunal
Semi-Supervised approach: Hypothesis behind this approach is to explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. (Google Colab Link)
Several approaches like K-Bert & K-Adapter are explored but couldn't work in available timeframe.
Two approaches implemented: Extractive replacing paragraph & Extractive appending paragraph. The later one obtained accuracy of 67.57% and was an improvement over baseline.Google Colab Link

Best Results:

Model	Gradient Accumulation Step	Number of Epochs	Learning Rate	Maximum Sequence Length	Batch Size	Accuracy
roberta-base (phase 1 results)	2	3	5 e-5	80	16	65.92%
roberta-base (grid search)	3	5	5 e-5	192	16	68.87%
roberta-base-uncased (Semi Supervised approach)	3	3	2 e-5	100	8	61.13%
roberta-base + finetuned on SocialIQA	3	6	5 e-5	192	16	67.75%
roberta-base + finetuned on SocialIQA + COSMOS finetuned	3	6	5 e-5	192	16	66.2%
roberta-base + sumarization + COSMOS	3	5	5 e-5	192	16	63.31%
roberta-base + context + summarization + COSMOS	3	5	5 e-5	192	16	67.57%

Conclusion:

Based on the error analysis on baselines models, we tried to make modifications in the existing models such that the model can learn more implicit knowledge and the context. The reason model on Social IQA performed better than existing results, is because of the similarities in dataset and task we are trying to achieve. Whereas, semi-supervised approach did not give any performance boosts for the same reason.

One the next steps for us would be to use pre-trained model on Social-IQA dataset to be used on cosmosqa dataset using the semi-supervised approach. Since summarization had better results than our baseline models, we use that approach to summarize the context for each data sample and use that in the fine-tuning of the model before evaluation. With a trade-off on training time, we expect such a model to perform better than the baselines.

Individual Efforts:

Kunal :

Trying to run cosmosqa_wilburOne on asu agave. Current status is that the cluster is not picking up my job.
Attempting to run the Roberta model on CoLAB.
Setting up a framework for performing error analysis on the test and validation datasets.
Phase 1 report.
Performed Error Analysis on 15 samples from 1000-1500 in the Validation set data
Finetuning roberta-base model on SocialIQA dataset for incorporating social knowledge.
Increased performance by finetuning on SocialIQA dataset on roberta-base from 65.92% to 67.75%. Let's call this model roberta_social_iqa.
Finetuned and generated a performance of 66.2 roberta_social_iqa on the COSMOS_QA to boost performance. Google Colab

Jay :

Was able to run the cosmosqa_wilburOne on my local CPU machine (without GPUs), but since the training + evaluation with huge datasize shall take time, trying to host the process on either Google Colab/Cloud Shell - but facing issues with installing Nvidia/Apex and few other version conflicts.
Also, trying to read the [https://arxiv.org/pdf/1904.01172.pdf] paper in order to come up with model tweaks for error analysis.
Now running the cosmosqa_wilburOne process on my Google Colab account with varying values of batch-sizes, epochs and train_examples (training data)
The updatesd results with different parameters are being stored in the results/ folder and to be later used for error analysis purposes.
Recorded a maximum accuracy of 60.77 using this model, with set hyper-parameters (train_examples=5000, epochs=2, lr=3e-5, batchsize=12)
Did and Error analysis for 15 samples out of the results/b12l3e-5ep2.txt for the multi-way attention model, and the inferences on produced errors are visible here as different sheets
Preliminary results and conclusions are well documented in /COSMOS_QA__Phase_1_Report_CSE_576.pdf
Currently working on implementing the semi-supervised approach for multiple choice question answering task as defined here in the paper - Improving Language Understanding by Generative Pre-Training by OpenAI on the official CosmosQA dataset
The script to run the code, results and model details for this approach are stored in the folder Semi-Supervised Approach.

Aditya :

Random baseline model provided by Allen AI setup, ran and tested on Google Colab
Tried running cosmosqa_wilburOne on Windows + GPU setup, but couldn't proceed further since 'apex' library by NVIDIA has limited support for CUDA in Windows.
Tried running cosmosqa_wilburOne on Windows Subsystem for Linux (WSL), couldn't proceed further since WSL doesn't have GPU access
Multiway Attention model ran successfully on Google Colab, debugged issues along with Jay and Kunal
RoBERTa baseline model ran successfully on Google Colab, debugged issues along with Vasishta and Suryanshu
Error Analysis for multiway attention model
Phase 1 Report
Performed Grid Search on roberta baseline model to improve model accuracy from 65.92% to 68.87%
Modified trasformers code to disable checkpoint saving and ran RoBERTa-large successfully

Vasishta :

Evaluating the use of Amazon Web Service GPU instances K80 P2.xlarge to run cosmosqa_wilburOne for the project along
Setup and running roBERTa on Google Colab.
Debugged roBERTa backward compatibilty issue - "KeyError: 'token_type_ids'".
Tried older roBERTa model to bypass the token type embedding layer issue.
Phase 1 Report.
Performed Error Analysis on 19 samples from 2000-2500 in the Validation set data
Co-ordinated in roberta-base grid search to improve accuracy by about 3%
Established Kaggle platform python notebook to mitigate Google Colab issues like idle timeout and 12 hour limit
Explored use of bert-large to improve accuracy

Suryanshu :

Went through the CosmosQA paper and forum "http://jalammar.github.io/illustrated-bert/" to get some idea about normal BERT working. Also going trying to go through GPT2 paper as it has been used to improve the performance of cosmosQA.
Initially, was trying to run the project(cosmosqa_wilburOne) on my windows system but didnt succeed due to failure in installing apex module correctly.
Installed Ubuntu as dual boot. Tried running cosmosqa_wilburOne. Had issues with cuda installation initially but later, could install apex, cuda and was able to acees GPU computes on my system. Currently, having issues with GPU memory consumption by cuda.
Able to run pre trained RoBERTa base and large model on Google colab but on commonsense dataset. Still trying to modify it to make it evaluate against Cosmos QA dataset, which hpefully would result in better prediction.
Modified RoBERTa baseline model to output evaluation result with all details such as predicted and expected labels.
Error analysis for both Cosmos QA baseline mode as well RoBERTa baseline model.
Phase 1 report.
Tried to run roBERTa Large model in huggingface, which proved to be very computationally expensive and couldn't be trained within time.
Read through several papers like K-Bert, K-adapter on how to include knowledge graph into Bert and roBERTa model.
Tried implementation of Enhanced adversarial training for NLP paper but was difficult to train two Bert models.
Explored some approaches for summarization. Successfully implemented Extractive summarization based approach and improved the performance. Tried several other approaches for Abstractive summarization but takes a good amount of time in embeddings part itself.

Running cosmosqa_baseline:

This random baseline model is provided by Allen AI for demonstrating input and output of data
Google Colab link

Running cosmosqa_wilburOne:

This multiway attention model is provided by Lifu Huang et al.
Google Colab link

Running RoBERTa base:

This RoBERTa base model is provided by huggingface
Google Colab link

Grid Search:

Results:

Model	Gradient Accumulation Step	Number of Epochs	Learning Rate	Maximum Sequence Length	Batch Size	Accuracy
roberta-base	1	3	5 e-5	64	64	64.95%
roberta-base	2	3	5 e-5	64	64	64.05%
roberta-base	1	3	5 e-5	128	16	60.26%
roberta-base	2	3	5 e-5	128	16	66.36%
roberta-base	3	3	5 e-5	128	16	67.63%
roberta-base	4	3	5 e-5	128	16	65.72%
roberta-base	5	3	5 e-5	128	16	63.98%
roberta-base	3	4	5 e-5	128	16	68.24%
roberta-base	3	5	5 e-5	128	16	67.70%
roberta-base	3	3	5 e-5	192	16	67.20%
roberta-base	4	3	5 e-5	192	16	65.69%
roberta-base	3	4	5 e-5	192	16	68.67%
roberta-base	3	5	5 e-5	192	16	68.87%
roberta-base	3	6	5 e-5	192	16	65.92%

Other Models:

Results:

Model	Gradient Accumulation Step	Number of Epochs	Learning Rate	Maximum Sequence Length	Batch Size	Accuracy
bert-base-uncased	1	3	5 e-5	128	32	60.63%
bert-large-uncased	2	3	5 e-5	64	12	25.09%

Running RoBERTa-Base Uncased (Sem-Supervised Approach):

Semi-Supervised approach for multiple choice question answering task as defined here in the paper - Improving Language Understanding by Generative Pre-Training by OpenAI on the official CosmosQA dataset
Google Colab Link

Results:

Model	Gradient Accumulation Step	Number of Epochs	Learning Rate	Maximum Sequence Length	Batch Size	Accuracy
roberta-base	2	3	5 e-5	80	16	62.479
roberta-base	3	4	5 e-5	80	64	67.2
roberta-base-uncased	3	3	2 e-5	100	8	61.13

Running External Knowledge Infusion (SocialIQA Dataset):

Google Colab Link

Results:

Model	Gradient Accumulation Step	Number of Epochs	Learning Rate	Maximum Sequence Length	Batch Size	Accuracy
roberta-base + finetuned on SocialIQA	3	6	5 E-05	192	16	67.75
roberta-base + finetuned on SocialIQA + COSMOS finetuned	3	6	5 E-05	192	16	66.2

Running Context Summarization:

Google Colab Link

Results:

Model	Gradient Accumulation Step	Number of Epochs	Learning Rate	Maximum Sequence Length	Batch Size	Accuracy
roberta-base + sumarization + COSMOS	3	5	5 E-05	192	16	63.31
roberta-base + context + summarization + COSMOS	3	5	5 E-05	192	16	67.57

AdityaGovardhan/cosmosQA-research

Status:

Best Results:

Conclusion:

Individual Efforts:

Running cosmosqa_baseline:

Running cosmosqa_wilburOne:

Running RoBERTa base:

Grid Search:

Other Models:

Running RoBERTa-Base Uncased (Sem-Supervised Approach):

Running External Knowledge Infusion (SocialIQA Dataset):

Running Context Summarization: