Problem Statement: https://www.kaggle.com/competitions/jargon-detection/overview
This project aims to identify domain-specific technical terminology (jargon) used in scientific research papers. The task involves identifying specific words or commonly used terms with particular meanings within a particular field. The goal is to tag the main nouns in nominal phrases and model the task as a sentence-level sequence labelling problem. The project provides manually-labelled training and development datasets from three distinct scientific domains: Computer Science, Economics, and Physics. Each scientific domain has its own train/dev/test splits, and the dataset includes over 7000+ Computer Science sentences, 6000+ Economics sentences, and 8000+ Physics sentences. The evaluation is based on precision, recall, and F1 scores on the hidden test set computed for correct prediction for each sentence term. Finally, the Kaggle score estimates the model performance in the hidden private dataset.
Jargon Detection deals with the detection of special terminology used by a particular group of people in that profession and this word may not be regularly used by the people of other domains. This comes under the Sequence Labeling task where we assign, to each word x_i in an input word sequence, a label y_i, so that the output sequence Y has the same length as the input sequence X. Jargon Detection is important as it enables us to identify the context-sensitive terminologies used in a specific domain.
Token classification, also known as part-of-speech (POS) tagging, is a fundamental problem in natural language processing (NLP) that involves assigning a grammatical category or label to each word in a sentence. This task has been widely studied in the literature, and various machine learning techniques have been applied to improve its accuracies, such as rule-based systems, Hidden Markov Models (HMM), Maximum Entropy Markov Models (MEMM), Conditional Random Fields (CRF), and neural network-based models. The availability of large annotated corpora has also contributed to developing more accurate and robust POS tagging models. In recent years, deep learning models, particularly those based on Recurrent Neural Networks (RNNs)\cite{b0}, have shown remarkable success in POS tagging.
The use of technical terminology in scientific research papers can be a major obstacle for non-experts, as these terms have specific meanings that may not be easily understood by individuals outside the field. This could lead to misunderstandings or prevent readers from understanding scholarly writing altogether. To address this issue, it is essential to identify and recognize technical terms in scientific research papers, as this can facilitate the creation of a scientific-document reading system that helps readers better comprehend scholarly writing. By identifying technical jargon in scientific research papers, it is possible to reduce barriers to entry and improve the reader's comprehension. Non-experienced individuals may fail to understand technical terms or interpret common words in a different sense than intended, creating a significant entry barrier to reading scholarly writing.
The train data consist of a total of 574910 entries where label 'O' has 520286 entries and label 'TERM' has 54624 entries. The validation data has 37143 entries. The test data is not labelled and has 42358 entries.
In this project, we explored different approaches to tackle the task of identifying domain-specific technical terminology in scientific research papers. We started by implementing a simple CRF model with hand-crafted features and achieved moderate performance. Next, we incorporated ELMo embeddings for contextual representation along with Bi-LSTMs, which resulted in a significant improvement in the model's performance on all evaluation metrics. We also tried with BERTForTokenClassification pre-trained model by fine-tuning its training on our dataset, which also gave comparable performance with ELMo and Bi-LSTMs. Our findings suggest that the ELMo-based model outperformed the simple CRF model and performed comparably to the BERT-based model. Overall, our experiments highlight the importance of contextual representation in NLP tasks and the significance of using advanced models such as ELMo and BERT for achieving state-of-the-art results. We came up with a hybrid model (ELMo + Bi-LSTM + CRF), which gave the highest performance. We are constantly exploring ways to fine-tune our model hyperparameters and combine various model architectures in order to enhance the performance level.