This is something I wanted to do a long time ago. ML in basic Python.
No fancy frameworks, just basic Python data structures to code up simple ML algos like Linear Regression.
I didn't use even numpy
since, ... let's do it in the hard way 😃
Also, we'll appreciate numpy
only we'll get it know what happens in its absence.
The code is well commented, I'm providing the links to understand the theory.
Many might be having an opinion that the code is less efficient and verbose, but that is not the point here. The ideal audience are experienced developers who don't touch ML thinking that it requires a lot of framework meta learning. You may start from here, understand what happens in the ground level, appreciate the work already done and then start using frameworks.
Most importantly, here you'll understand how an ML algorithm works exactly and what are the basic building blocks of it. This helps a lot in applied ML since we get to know where to tweak and how to tweak.
This is reinventing the wheel? Yes. But sometimes we learn a lot if we start from the scratch.
Original dataset used can be found here.
I only used a part the entire dataset. Find the data in weatherData
folder.
[1] https://www.geeksforgeeks.org/linear-regression-python-implementation/
[2] https://realpython.com/linear-regression-in-python/
[3] https://towardsdatascience.com/linear-regression-using-gradient-descent-97a6c8700931
[5] https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html
Original dataset used can be found here.
I used a part the entire dataset. Find the data in irisData
folder.
[2] https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f
[3] https://en.wikipedia.org/wiki/Logistic_function
[4] https://github.com/leventbass/logistic_regression/blob/master/Logistic_Regression.ipynb
Original dataset used can be found here.
Find the data in golfData
folder.
NB: This works only for categorical features, not continuous features. We may need algorithms like Gaussian Naive Bayes, for handling continuous features.
[1] https://www.hackerearth.com/blog/developers/introduction-naive-bayes-algorithm-codes-python-r
[2] https://www.geeksforgeeks.org/naive-bayes-classifiers/
[3] https://towardsdatascience.com/naive-bayes-explained-9d2b96f4a9c0
Original dataset used can be found here.
I used a part the entire dataset. Find the data in irisData
folder.
[1] https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
[2] http://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/
[3] https://hackernoon.com/implementation-of-gaussian-naive-bayes-in-python-from-scratch-c4ea64e3944d
Text Classifier includes,
- Converting text to numerical features - Done
- Using Term Frequency (TF)
- Using Term Frequency - Inverse Document Frequency (TF-IDF)
- Attach a classifier algorithm like Naive Bayes - Gaussian Naive Bayes Classifier
- Train and test the model
Zero Variance Problem
This issue can arise on,
- There is an unobserved label or feature in training data which is present in test data
- There are few or no examples in training data to calculate likelihood for a feature
...
The ways to mitigate the issue are,
- Do not multiply the likelihood of zero variance features
- Use an epsilon value that could handle the situation of zero variance during the calculation of Gaussian function i.e. Likelihood
...
I chose the second one, but it is not very effective when then number of vocabulary, i.e. number of features increases.
Currently, I chose epsilon as 0.1
for 50 training examples.
I observed that, my GaussianNB classifier works well with Iris dataset, which is a balanced dataset with less non-zero data.
When it comes to a sparse representation like TF-IDF on a biased dataset like spamData
, things are going wrong.
Feel free to come up with a better solution for this issue.
Original dataset used can be found here.
Find the data in spamData
folder.
[1] https://www.geeksforgeeks.org/bag-of-words-bow-model-in-nlp/
[2] https://www.geeksforgeeks.org/tf-idf-model-for-page-ranking/
[3] https://gist.github.com/sebleier/554280
Zero Variance Problem
Original dataset used can be found here.
Find the data in airportData
folder.
Plots can be found in the plots/
folder once the script is run. Root mean square error is also calculated and printed for evaluation.
[2] https://www.analyticsvidhya.com/blog/2018/02/time-series-forecasting-methods/
Original dataset used can be found here.
I used a part the entire dataset. Find the data in clusterData
folder.
I've plotted the clusters after training the K-Means algorithm. It will be generated in the plots
if you run kMeansClustering.py
.
This plot is compared with original labels of the data to compare and understand if the clustering doing a good job.
Check out plots/clusterGraph.png
and plots/clusterGraphFromLabels.png
for the this comparison.
[2] https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1