NLP (Natural Language Processing)

Data set : https://www.kaggle.com/datasets/kazanova/sentiment140?select=training.1600000.processed.noemoticon.csv

Understanding Data Set


Data Set ada 6 kolum
  • target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  • date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  • flag: The query (lyx). If there is no query, then this value is NO_QUERY.
  • user: the user that tweeted (robotickilldozr)
  • text: the text of the tweet (Lyx is cool)

    Rename data set jadi : twitter16m.csv

    Library :

  • SpaCy

    Yang akan dipelajari

    General Feature Extraction

  • File loading
  • Word counts
  • Characters count
  • Average cahr per word
  • Stop words count
  • Caount #hashtags and @mentions
  • If numeric digits are present in tweets
  • Upper case word counts

    Processing and Cleaning

  • Lower Case
  • Contraction to Expansion
  • Email removal and counts
  • Removal of RT
  • Removal special characaters
  • Removal of multiple spaces
  • Removal Html TAGS
  • Removal of accented chars
  • Removal of stop Word
  • Conversion into base from of words
  • Common Occuring words Removal
  • Rare Ocuring Words Removal
  • Word Cloud
  • Spelling Correction
  • Tokenization
  • Lemmatization
  • Detecting Entities Using NER
  • Noun Detection
  • Language Detection
  • Sentence Translation
  • Using Inbuilt Sentiment Clasifier

    Advanced Text Processing and Feature Extraction

  • N-Gram, Bi-Gram etc
  • Bag of Words (BoW)
  • Term Frequency Calculation TF
  • Inverse Document Frequency
  • TFIDF Term Frequency Inverse Document
  • Word Embedding Word2Vec using SpaCy

    Machine Learning Models for Text Clasification

  • SGDClassifier
  • LogisticRegression
  • LogisticRegressionCV
  • LinerarSVC
  • RandomForestClasifier