Nowadays, most people have access to the Internet, and they cannot survive without Smartphone and computers. They not only use the Internet for fun and entertainment, but they also use it for business, stock marketing, searching, sending e-mails, and so on. Hence, the usage of the Internet is growing rapidly.
One of the threats for such technology is a spam. Spam has a lot of definitions, as it is considered one of the complex problems in e-mail services. Spam is a junk mail/message, or an unsolicited mail/message. Spam e-mails are also those unwanted, unsolicited e-mails that are not intended for a specific receiver. It is basically an online communication sent to the user without permission. It takes on various forms like adult content, selling products or services, job offers, and so forth. The spam has increased tremendously in the last few years.
The good, perfect, and official mails are known as ham. It is also defined as an e-mail that is generally desired. Today, more than 85% of mails or messages received by users are spam. It costs the sender very little time to send, but most of the costs are paid by the recipient or the service providers rather than by the sender.
The cost of spam can also be measured in lost human time, lost server time and loss of valuable mail/messages. The sent mail reserves a quota in the server, and the receiver may have a limited space, causing the server to reject another ham mail because it is out of space. Moreover, the reader may lose a lot of time in reading unuseful messages.
The Machine Learning field has a robust, ready-made and alternative way for solving this type of the problem. We have used the power of Machine Learning and built a classifier, which helping us to classify which message is/are spam or ham depending upon their content.
App link 1:
https://spamhamprediction.herokuapp.com/
App link 2:
https://spamhampredictioncicd.herokuapp.com/
Note: In link 2 CI-CD pipeline has been integrated using CircleCI
Spam_Ham_Demo_.mp4
- Python 3.7 and more
- Important Libraries: sklearn, pandas, numpy, matplotlib & seaborn
- Front-end: HTML, CSS
- Back-end: Flask framework
- IDE: Jupyter Notebook, Pycharm & VSCode
- Database: MongoDB
- Deployment: Heroku
- CI-CD Pipeline: CircleCI
- UnitTest Case: Pytest
- Version Control Sysytem: Git
- Remote Repository: Github
- Containerization: Docker
Code is written in Python 3.7 and more. If you don't have python installed on your system, click here https://www.python.org/downloads/ to install.
Create virtual environment - conda create -n myenv python=3.7 Activate the environment - conda activate myenv Install the packages - pip install -r requirements.txt Run the app - python app.py
A very simple UI where text area is provided. In this text area, User need to write their message into this text area and then they can classify their message by clicking on predict button.
Spam Ham Data Set from UCI Machine Learning Repository.
Link:https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
- Drop unnecessary columns.
- Drop duplicates rows from dataset.
- Rename required columns.
- Encode target column using LabelEncoder.
- LowerCase
- Tokenization
- Removing Special Characters
- Removing stop words and punctuation
- Stemming
- Various classification model created.
- Algorithms used are Naive Bayes, Logistic Regression, DecisionTreeClassifier,RandomForestClassifier, AdaBoostClassifier, Bagging Classifier etc.
- Multinomial naive_bayes classiifer has given good accuracy and precision.
- Model performance evaluated based on accuracy, precision.
- Final Model selected is deployed over Heroku cloud environment using WebApp created from Flask framework.
Upendra Kumar: https://www.linkedin.com/in/imupendra/
Hello Reader if you find any bug please consider raising issue I will address them asap.