siddharth271101/AA-task

Python

AA-task

Problem Statement:

Imagine there is a file full of Twitter tweets by various users and you are provided a set of words that indicates racial slurs. Write a program that can indicate the degree of profanity for each sentence in the file.

Approach:

Used TfidfVectorizer to vectorize the text before feeding into a SVM classifier to predict the results.
The SVM classifier was trained 200k labelled samples of clean and profane text.
Since SVM does not natively predict probabilities, Therefore the SVM is fit via the CalibratedClassifierCV class so that it returns a probability for each class instead of just a classification.

Assumptions:

Due to limited time instead of a set of words that indicate racial slurs, I have used a labelled dataset. The dataset contains 2 columns, the first column denotes if the particular text is offensive or not, the second column has the actual text.

Result: