Authorship attribution is a type of classification task concerned with detecting the author of a piece of writing by analyzing its stylometric and linguistic features within a set of probable authors. Anonymity is very common in recent times, especially due to widespread use of internet the use and misuse of anonymity has become an important factor to consider. The plethora of anonymous digital footprints makes authorship attribution indispensable in various fields. In this work we perform transfer learning for authorship attribution based on a language modeling objective. Unsupervised training of a language model teaches a model the working and structure of Bangla language, followed by authorship attribution specific fine-tuning and classification. Effects of various tokenization on this model are analyzed in terms of performance. The results demonstrate a clear superiority of the transfer learning based approach against the other traditional models.
- In this paper and repository, we introduce the largest and most varied dataset for AABL with long text samples of 16 authors in an imbalanced manner, imitating real-world scenarios more closely.
- We present an intuitively simple but computationally effective transfer learning approach that is pre-trained on a large corpus, fine-tuned with the target dataset, and later the classifier is trained with labeled data for Authorship Attribution in Bangla Literature (AABL). To the best of our knowledge, no work has been done to leverage the power of transfer learning for AABL, which can immensely reduce manual labor and enhance model re-usability manifold. Experimental results show that the proposed model considerably outperforms the existing models and achieves state-of-the-art performance, efficiently solving the limitations of the previous works.
- The various language models trained in this work can be used as pre-trained models for many downstream tasks in the Bangla language. All of these pre-trained models, along with the code and dataset of this paper, have been released for public use.
All models, along with the datasets have been released in this repository.
Our academic paper on Authorship Attribution in Bangla language can be found here.
Authorship Attribution is the task of creating appropriate characterization of texts that captures the authors' writing style to identify the original author of a given piece of text. With increased anonymity on the internet, this task has become increasingly crucial in various fields of security and plagiarism detection. Despite significant advancements in other languages such as English, Spanish, and Chinese; Bangla lacks comprehensive research in this field due to its complex linguistic feature and sentence structure. Moreover, existing systems are not scalable with increasing number of authors, and performance drops with small number of samples per author. In this paper, we propose the use of Average-Stochastic Gradient Descent Weight-Dropped Long Short-Term Memory (AWD-LSTM) architecture and an effective transfer learning approach that addresses the problem of complex linguistic feature extraction and scalability for authorship attribution in Bangla Literature (AABL). We analyze the effect of different tokenization, such as word, sub-word, and character level tokenization, and demonstrate the effectiveness of these tokenizations in the proposed model. Moreover, we introduce the publicly available Bangla Authorship Attribution Dataset of 16 authors (BAAD16) containing 17,966 sample texts and 13.4+ million words to solve the standard dataset scarcity problem and release six variations pre-trained language models for use in any Bangla NLP downstream task. For evaluation, we used our developed BAAD16 dataset as well as other publicly available datasets. Empirically, our proposed model outperformed state-of-the-art models and achieved 99.8% accuracy in the BAAD16 dataset. Furthermore, we showed that the proposed system scales much better with increasing number of authors, and the performance remains steady even with small number of samples.
This repository is structured in the following way:
- Model: Contains all the model training and testing code for the proposed method of the paper. Further subdivided into folders containing 3 types of tokenization.
- Character-level starter code: character news, character news
- Subword-level starter code: subword news, subword wiki
- Word-level starter code: word_news, word_wiki
- Others Models: Contains tests on previously published models for comparison.
- News dataset: A corpus on Bangla newspaper articles created using a custom web crawler containing 12 different topics. Paper reference
- Wikipedia corpus: The Bangla Wikipedia dump collected on 10th June, 2019. Cleaned and arranged into samples in csv.
- BAAD16: The published dataset with sample Bangla texts from 16 authors in an unbalanced manner to mimic real world scenarios closely and test model robustness. Equally partitioned with each document having 750 words.
- BAAD6: A dataset consisting of text samples from 6 authors with 350 samples per author. Paper reference.
All trained model checkpoints are provided in the folder Model Checkpoints. The language models have been trained on Bangla news corpus and Wikipedia dumps. Further, there are 3 variations to each. Word, Subword and Character level tokenized, making a total of 6 pre-trained language models.
To reproduce the pre-training, fine-tuning, or classification of these models:
- Clone the repository
- Download relevant files/checkpoints from this drive link and put them inside
Model Checkpoints
folder. Maintain the folder structure as in the drive folder. - Make sure you have Python3.x
- Install requirements.txt.
pip install requirements.txt
- Run the python notebooks (extra dependencies are installed in the notebook)
- Assessing application of pure-Bangla tansformer based pre-trained models.
- CNN architecture based transfer learning for authorship attribution.
- Cross-lingual authorship attribution.