/Persian-News-Classification

Persian News Classification using SVM and Fasttext Pretrained Word Embeddings

Primary LanguageJupyter Notebook

Persian News Classification

This project was the fourth assignment of the computational intelligence course of the Shahid Beheshti Univesity. The goal of this project was to train a model to classify the persian news.

Dataset

The dataset has been obtained from The Kaggle page for this project.It contains news texts of around 40 topices:

  1. Number of categories: ~ 40
  2. Number of images in training set: 150,096
  3. Number of images in test set: 16,678

Data Preprocessing

The most important part in a NLP task is to clean the data! There are lots of redundant an unneccessary characters in the both training and test set which was removed. And also the act of tokenizing the words was done by HAZM library. The step by step of the data preprocessing task is explained in the notebook.

Model : SVM and Word-Embeddings

  1. The first approache to solve this problem is to use pretrained word-embeddings. Fot this project, the Fasttext word embeddings was used. It is 300 dimentional word embedding which was the Facebook researchers achivement. You can see the set up tools and the installation explanation in their website.

  2. SVMs: A simple linear SVM was used to do the training for us and it got the best result!