/Feature-Engineering-NLP

Feature engineering with the help of HuggingFace transformers, Tensorflow, Keras, TextBlob, NLTK, Sci-kit learn etc.

Primary LanguageJupyter Notebook

Feature-Engineering-NLP

Please find my project here, to observe the profiling report and interactive visualizations : https://www.kaggle.com/breenda/feature-engineering/data
Textual Data comes in plenty but is unorganised and messy in its raw form. Hence, feature engineering is an important step before training a machine to make predictions based on the given data.
In my notebook, I have explored a number of preprocessing and feature engineering techniques to gain insights from a collection of fake and real news data.


Goal of this project is to explore the ways in which raw textual data can be organised for useful insights. A few of the most common feature engineering techniques are explored for the same.

Dataset :

https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

Libraries used:

  • NLTK
  • TextBlob
  • Keras
  • Tensorflow
  • HuggingFace Transformers
  • Regex
  • Pandas
  • Sci-kit learn
  • Features explored:

  • Sentiment Subjectivity and Polarity
  • Removed Stopwords, punctuation marks and lowered casing:
  • Generated Lemmatized and Stemmed versions of news title and body
  • Generated a vocabulary dictionary for the news headlines and plotted a wordcloud plot
  • Generated word count, character count and average length of phrases used in the news body
  • Generated bigrams, trigrams and TF-IDF matrix of the news body content.
  • Generated word vector embeddings from pre-trained model GloVe
  • Generated encoded tokens with the help of BertWordPieceTokenizer
  • Generated sentiment analysis, Name Entity Recognizers and Summary of the news body