JSON file with 3 columns (body, title, category) Each article falls into one of three categories:
- Engineering,
- Product & Design,
- Startups and Business
- In order to extract features from the text we can use (Bag of Words, N_grams or TF-IDF). And to cluster our categories into subcategories, we can either use the title, or the article body. So, I examined combinations of the above.
- I could reach a classifier accuracy of 85% using TF-IDF on the body, and using the XGBoost Classifier.
- For the Subcategory Clusters, I got different results, but saved those created using BOW on the titles, as I believe titles are more intuitive and representative in clustering, and I think BOW works best with titles than the TF-IDF method.
- Check for NaNs
- Check for Duplicates
- Text Preprocessing
Built a Supervised Learning Model (Classifier) and achieved an accuracy of 85%, using the full article body.
Clustered each of the 3 category to SubCategories