Feature-Engineering-NLP

Please find my project here, to observe the profiling report and interactive visualizations : https://www.kaggle.com/breenda/feature-engineering/data
Textual Data comes in plenty but is unorganised and messy in its raw form. Hence, feature engineering is an important step before training a machine to make predictions based on the given data.
In my notebook, I have explored a number of preprocessing and feature engineering techniques to gain insights from a collection of fake and real news data.

Goal of this project is to explore the ways in which raw textual data can be organised for useful insights. A few of the most common feature engineering techniques are explored for the same.

Dataset :

https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

Libraries used:

NLTK

TextBlob

Keras

Tensorflow

HuggingFace Transformers

Regex

Pandas

Sci-kit learn

Features explored:

Sentiment Subjectivity and Polarity

Removed Stopwords, punctuation marks and lowered casing:

Generated Lemmatized and Stemmed versions of news title and body

Generated a vocabulary dictionary for the news headlines and plotted a wordcloud plot

Generated word count, character count and average length of phrases used in the news body

Generated bigrams, trigrams and TF-IDF matrix of the news body content.

Generated word vector embeddings from pre-trained model GloVe

Generated encoded tokens with the help of BertWordPieceTokenizer

Generated sentiment analysis, Name Entity Recognizers and Summary of the news body

ds-brx/Feature-Engineering-NLP

Feature-Engineering-NLP

Dataset :

Libraries used:

Features explored: