/Natural-Language-Processing-Classification-and-Clustering

NLP Classification and Clustering with spam SMS dataset

Primary LanguageJupyter Notebook

Natural Language Processing Classification and Clustering

Project Context

The is a SMS Spam related dataset. It is a public set of SMS labeled messages that have been collected for mobile phone spam research. The classification goal is to predict whether the message is a spam or ham message.

This dataset is downloaded from https://archive.ics.uci.edu/ml/datasets/sms+spam+collection and you can download it here in csv format as well.

Project Introduction

The Classification and Clustering in Natural Language Processing (NLP) will be applied. Our target is to predict email types (ham or spam) and divide similar sms keywords into numbers of groups.

Methodologies

LinearSVC and TfidfVectorizer (Classification)

K-Means-Clustering (Clustering) -> In Progress

Creator

Jerry Ng (City University of Hong Kong)