NLP (Natural Language Processing)

Data set : https://www.kaggle.com/datasets/kazanova/sentiment140?select=training.1600000.processed.noemoticon.csv

Understanding Data Set

Data Set ada 6 kolum

target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

flag: The query (lyx). If there is no query, then this value is NO_QUERY.

user: the user that tweeted (robotickilldozr)

text: the text of the tweet (Lyx is cool)

Rename data set jadi : twitter16m.csv

Library :

SpaCy

Yang akan dipelajari

General Feature Extraction

File loading

Word counts

Characters count

Average cahr per word

Stop words count

Caount #hashtags and @mentions

If numeric digits are present in tweets

Upper case word counts

Processing and Cleaning

Lower Case

Contraction to Expansion

Email removal and counts

Removal of RT

Removal special characaters

Removal of multiple spaces

Removal Html TAGS

Removal of accented chars

Removal of stop Word

Conversion into base from of words

Common Occuring words Removal

Rare Ocuring Words Removal

Word Cloud

Spelling Correction

Tokenization

Lemmatization

Detecting Entities Using NER

Noun Detection

Language Detection

Sentence Translation

Using Inbuilt Sentiment Clasifier

Advanced Text Processing and Feature Extraction

N-Gram, Bi-Gram etc

Bag of Words (BoW)

Term Frequency Calculation TF

Inverse Document Frequency

TFIDF Term Frequency Inverse Document

Word Embedding Word2Vec using SpaCy

Machine Learning Models for Text Clasification

SGDClassifier

LogisticRegression

LogisticRegressionCV

LinerarSVC

RandomForestClasifier

jun-aldi/Complete-Text-Processing