/text_filter_nltk

Implementation of NLTK and different machine learning classifiers for text classification.

Primary LanguageJupyter Notebook

Text Spam Filter by using Natural Language Processing

This repository contains the development of text spam filter by using NLTK and Scikit-learn.

Implementation

In the project, I have implemented the basics of tokenising, part-of-speech tagging, stemming, chunking, and named entity recognition; furthermore, I dove into machine learning and text classification using a simple support vector classifier, KNN, decision tree, random forest, logistic regression, SGD, Naive Bayes classifiers. In the end, I have used the voting classifier as an ensemble method to improve model accuracy. The dataset I have used comes from the UCI Machine Learning Repository. It contains over 5000 SMS labelled messages that have been collected for mobile phone spam research.

It is divided into the following sections:

  • Regular Expressions
  • Feature Engineering
  • Multiple scikit-learn Classifiers
  • Ensemble Methods