This is a small Python project which explores using scikit-learn to classify penguins by species in the Palmer penguins dataset given their bill features. This was a personal project that I used to learn about support vector machines in scikit-learn.
To run the code you will need a Python 3 installation with the packages listed in environment.yml
. To create an environment with these packages using the Anaconda distribution, run the following conda
command in the repo directory:
conda env create -f environment.yml
This will create an environment called penguin-models
. You can activate the environment with:
conda activate penguin-models
And deactivate it with:
conda deactivate
See the conda documentation for further information on environments.
To run the analysis, start an IPython shell:
ipython
Then import the analysis
module and call its run
method:
import analysis
analysis.run()
This will load the data, train the models, and create the plots in the plots
directory. There is an index.html
file in the plots
directory that shows all of the plots in an annotated webpage.
The plots use a custom matplotlib theme called eda
. In the plots
module this is loaded from the file style/eda.mplstyle
.
If you want to use this style in other projects, you can copy the file into your matplotlib style library, which is normally located at ~/.matplotlib/stylelib
. You can then load it with:
import matplotlib.pyplot as plt
plt.style.use(['eda'])
I've been learning how to use scikit-learn with Aurelien Geron's book Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. It's really good. This project was produced by applying what Geron teaches in his book to a novel dataset.