The is a SMS Spam related dataset. It is a public set of SMS labeled messages that have been collected for mobile phone spam research. The classification goal is to predict whether the message is a spam or ham message.
This dataset is downloaded from https://archive.ics.uci.edu/ml/datasets/sms+spam+collection and you can download it here in csv format as well.
The Classification and Clustering in Natural Language Processing (NLP) will be applied. Our target is to predict email types (ham or spam) and divide similar sms keywords into numbers of groups.
LinearSVC and TfidfVectorizer (Classification)
K-Means-Clustering (Clustering) -> In Progress
Jerry Ng (City University of Hong Kong)