Clickbait-Identification

In this project we have compared the properties of Clickbait titles vs Non-Clickbait titles. We have then classified the data using simple classifier models such as SVMs, Logistic Regression and XGBoost.
The project is done in Hindi. Created for the course project of the Spring '21 Course - Computational Linguistics-1.

Tasks

Preprocess the dataset
Analysis of data
Plotting graphs of all the analysis.
Making word clouds for the different name entities present.
Comparing the list of entities present in Clickbait vs Non-Clickbait

Analysis

Analysis is done on the basis of-

Number of Tokens
Presence of Question Marks
Presence of Exclamation Marks
Presence of Quotations
Number of Stopwords
Presence of Numerals
The Entities Present
POS Tags

In Clickbait.ipynb all the above analysis have been made for Clickbait vs Non-Clickbait sentences.

Classifying Data

Classifier.ipynb contains the scaled version of data generated and code to make predictions on the test dataset. We use >70% of the training data for training the models and the rest to verify the accuracy of the models.

Dataset Used

Dataset containing 41800 Hindi sentences labelled either 0(Non-Clickbait) or 1(Clickbait).

Tools used

We have used

pandas to analyse data
re to clean the data
NLTK to tokenize
A custom stopword file (stopword.txt) to identify stopwords
polyglot library for NER
wordcloud python package and mathplotlib.pyplot to make wordclouds
stanza to POS tag
mathplotlib and seaborn have been used to make the graphs.

Results

Classifier 1: Logistic Regression

Highest accuracy achieved was: 0.6874

Classifier 2: SVMs

Highest accuracy achieved was: 0.7068

Classifier 3: XGBoost

Highest accuracy achieved was: 0.7001

The report.pdf contains the analysis of the graphs obtained in the process and some other observations.

esh04/Clickbait-Identification