Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language. It's is a critical NLP problem and a longstanding artificial intelligence milestone. In this project we present the use of Splinter and SpanBERT models to showcase and try to solve the question-answering problem both in the closed domain and open domain as well, where the dataset used was (open domain) is SQuAD 2.0 and also a separate dataset (COVID dataset) has been generated by us for the splinter model (closed domain).
Splinter is a model that has been pre-trained for few-shot question answering in a self-supervised manner. This implies it was pre-trained on raw texts solely, with no human labelling (which is why it can use so much publicly available data), and then used an automatic method to build inputs and labels from those texts.
- A pretrained model for few-shot question answering.
- Can leverage recurring spans: n-grams, such as named entities, which tend to occur multiple times in each passage
- We emulate question answering by masking all but one instance of each recurring span with a special [QUESTION] token and asking the model to select the correct span for each such token.
- To select an answer span for each [QUESTION] token in parallel, we introduce a question-aware span selection (QASS) layer, which uses the [QUESTION] token’s representation to select the answer span.
- An Upgraded version of the BERT model.
- Unlike BERT in SpanBERT we mask random contiguous spans of text, rather than randomly mask tokens in a sequence.
- SpanBERT, the only thing the model is trained on is the Span Boundary Objective which later contributes to the loss function.
- SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.
-
Representation of the Closed Domain, preprocessed data (COVID data).
-
Training process representing the training & validation loss
In this project we tried to build a question-answering system with the acquired data using pre-processing methods and doing data acquisition for the dataset to work for these models Splinter and SpanBERT models. Our approach converts texts into a set of questions that need to be answered simultaneously.
Thank You :)