This project is my version of reproducing the work of a research paper submitted August 12, 2016 by a team of reserachers from Qatar Computing Research Institute: "Rapid Classification of Crisis-Related Data on Social Networks using Convolutional Neural Networks" by Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muhammad Imran, Prasenjit Mitra.
This paper introduced neural network based classification methods for binary and multi-class tweet classification task. It makes use of out-of-event data in the early hours of a disaster and achieved a better result with CNN (compared to RF, LR, and SVM)
I reproduce this paper for learning purpose to get myself more proficient with text mining techniques and gain a deeper understanding of neural nets.
Techniques, models and concepts used and explained in this project are:
- preprocessing techniques for text data
- tokenization
- word vector and embeddings
- word2vec
- Bag of words model
- Tagging
- TFIDF
- PCA
- Convolutional Neural Network
- Recurrent Neural Network
- Clustering
The raw dataset is from CrisisNLP. The concatenated clean data version of 9 major events is under data folder.