This is the sentiment 140 dataset. It contains 1,600,000 tweets extracted using the twitter api .
- The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .
It contains the following 6 fields :
1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. ids: The id of the tweet.
3. date: the date of the tweet.
4. flag: The query. If there is no query, then this value is NO_QUERY.
5. user: the user that tweeted.
6. text: the text of the tweet.
Text Preprocessing is traditionally an important step for Natural Language Processing (NLP) tasks.
It transforms text into a more digestible form so that machine learning algorithms can perform better.
The Preprocessing steps taken are:
1. Lower Casing: Each text is converted to lowercase. Replacing URLs: Links starting with "http" or "https" or "www" are replaced by "URL".
2. Replacing Emojis: Replace emojis by using a pre-defined dictionary containing emojis along with their meaning. (eg: ":)" to "EMOJIsmile")
3. Replacing Usernames: Replace @Usernames with word "USER". (eg: "@Kaggle" to "USER")
4. Removing Non-Alphabets: Replacing characters except Digits and Alphabets with a space.
5. Removing Consecutive letters: 3 or more consecutive letters are replaced by 2 letters. (eg: "Heyyyy" to "Heyy")
6. Removing Short Words: Words with length less than 2 are removed.
7. Removing Stopwords: Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. (eg: "the", "he", "have")
8. Lemmatizing: Lemmatization is the process of converting a word to its base form. (e.g: “Great” to “Good”)