/KNN-and-Naive-Bayes-practice

Week 9 IP

Primary LanguageJupyter NotebookMIT LicenseMIT

KNN-and-Naive-Bayes-practice

Week 9 IP

Titanic dataset and Spambase dataset analysis

Moringa Data Science Core W8 Independent Project

About The Project

This week we learnt about Machine Learning. specifically:KNN models and Naive Bayes. I will be putting what I learnt to test with this dataset about the Titanic accident and a spam email dataset

I intend to find out whether we can use the independent variables to come up with an accurate prediction of the likelihood of a person dying based on the other variables like age, gender, ticket and other variables.

I also intend to be able to predict the likelyhood of a message being spam or not using the second dataset

I think I succesfuly created models and tested their accuracy succesfuly. However there is always room for improvement.

My best model on the Spam message classification scored an accuracy score of 90%. My best model on the Titanic datset scored an accuracy score of 77%.

A suggested point of improvement would be to get an even better accuracy score. And to increase visualizations diagrams for each model

Built With

Here are the major tools that we used for the data analysis

Usage

I did this analysis with the intention of improving my skills in machine learning. However, the models used in this analysis could be used by engineers to improve safety standards of their ships and passsengers to select the safest way to travel.

At the moment, the model is tuned to predict the likelyhood of a passenger perishing on a ship. however, it can be easily be edited to work with other modes of transport like rail or air

Regarding the analysis of spam messages, I did this analysis with the intention of improving my skills in supervised machine learning. However the models here could be used by mobile phone operaters, independent software developers like truecaller to identify spam messages or even emails.

At the moment, the model is tuned to predict the likelyhood of a message being spam. however, it can be easily be edited to work with other lines of communication like emails and calls

Roadmap

Following my analysis I identified some gaps in the data and would like to continue improving the dashboard and analysis in order to come up with a more accurate prediction

Some of the data that would have been nice to have are:

  1. Better structured datasets. I had a particularly hard time setting column names on the spambase dataset

On a side note

Similar datasets for different modes of transport and communication lines will be great for prediction of different scenarios

Contributing

We would love to continue improving this analysis. Please contribute.. 😃 😃

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch
  3. Commit your Changes
  4. Push to the Branch
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Ian Tirok - Ian - ian.tirok@gmail.com