These are ML pipelines created as an exercise in Udacity's Machine Learning DevOps Engineer course.
In this project, participants were asked to create a training pipeline and random forest model testing. The data used contains a number of song features with the target feature in the form of the genre of the song.
The training pipeline used is illustrated in the following figure:
The imported data has a parquet format which is read using Arrow. Data checking is done to check the structure and standards set on the data (see the run.py file in the check_data folder). Data preprocessing is done by filling in the missing values and standardizing the numeric column. The next data is segregated into training data and testing data. The Random Forest model was then trained on the training data. The model is then evaluated on the test data and the AUC curve value is 0.95.
- Make sure you have conda installed in your computer
- Install wandb library (
pip install wandb
) - Create wandb account (www.wandb.ai)
- Get your wandb API Key (run
wandb login
, open the urls that appears in your terminal) and paste in your terminal - Run the project (
mlflow run -v 1.0.0 https://github.com/mohrosidi/genre_classification.git
) - When the run is finished, open your wandb account and see the report of your project