The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
- Python 3.6
- Apache Spark
- PySpark- Python API for Apache Spark
- Jupyter Notebook
!pip install pyspark
- we removed the unwanted expression from the data set
- we Tokenized it and used count vectorizer to form feature.
- we converted the string label to integer format.
- we used differnt algorithm of classification such as:
- Random Forest Classification
- Logistic Regression
- Naive Bayes Classification
- K Mean clustering
- Support Vector Machine
algorithm | accuracy |
---|---|
Random Forest Classification | 0.94 |
Logistic Regression | 0.95 |
Naive Bayes: | 0.98 |
Support Vector Machine | 0.96 |
K Mean clustering | 0.81 |
There are no specific guidlines for contibuting. If you see something that could be improved, send a pull request! If you think something should be done differently (or is just-plain-broken), please create an issue.
Anubhav Nigam
Anant Tripathi
Priyank Malviya
https://towardsdatascience.com/multi-class-text-classification-with-pyspark-7d78d022ed35
https://spark.apache.org/docs/2.1.0/ml-features.html
https://www.kaggle.com/benvozza/spam-classification/data
This Project is under the MIT License