Problem Statement:
- Imagine there is a file full of Twitter tweets by various users and you are provided a set of words that indicates racial slurs. Write a program that can indicate the degree of profanity for each sentence in the file.
Approach:
- Used TfidfVectorizer to vectorize the text before feeding into a SVM classifier to predict the results.
- The SVM classifier was trained 200k labelled samples of clean and profane text.
- Since SVM does not natively predict probabilities, Therefore the SVM is fit via the CalibratedClassifierCV class so that it returns a probability for each class instead of just a classification.
Assumptions:
- Due to limited time instead of a set of words that indicate racial slurs, I have used a labelled dataset. The dataset contains 2 columns, the first column denotes if the particular text is offensive or not, the second column has the actual text.