/NLP_DisasterTweets

Predict which Tweets are about real disasters and which ones are not

Primary LanguageJupyter Notebook

Natural Language Processing with Disaster Tweets

Predict which Tweets are about real disasters and which ones are not


Abstract:

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e., disaster relief organizations and news agencies). But, it’s not always clear whether a person’s words are actually announcing a disaster.
In this notebook, I've built a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.
*Also, I've created two notebook, one of them is commented and fully detailed and easy to read and train, And other one is functional implementation of first one and more clean.

Dataset:

The dataset that I've used in this notebook is available on Kaggle site. [link]

In this notebook I've performed some steps such as :

1) EDA: a little look at dataset with some graphs

2) Clean data in two steps:

  • Remove duplicated tweets
  • Find similar tweets and drop them

3) Extract some features such as:

  • Length of tweets
  • Count of words in tweets
  • Count of numbers in tweets
  • Count of sentences in tweets
  • Count of hashtags in tweets
  • Text of hashtags
  • Count of mentions in tweets
  • Text of Mentions
  • Count of links in tweets
  • Word per length of tweet
  • Punctuation count per tweet length
  • Uppercase letters count per tweet length
  • MinMaxScaling for numeric columns

4) Process Tweets such as:

  • Lowercase tweets
  • Remove URLs
  • Remove Punctuation
  • Remove Short words <=2 chars
  • Remove Stopwords
  • Lemmatization
  • GetDummy for keyword column

5) TF-IDF:

  • TF-IDF on tweets
  • TF-IDF on text of hashtags
  • TF-IDF on text of mentions

6) Train and test models:

  • GradientBoostingClassifier
  • NaiveBayes
  • LogisticRegression
  • SVM

And I've got approx. 0.8 score on main test set on kaggle site.