This project involves the development of four different predictive models for the MNIST dataset and providing an interactive program for users to evaluate these models.
notebook.ipynb
: Jupyter Notebook containing the data exploration, preprocessing, model training, and validation.README.md
: Explanation of the project structure and decision-making process.
- Logistic Regression
- Random Forest
- Neural Network
- Bayesian Logistic Regression
- The MNIST dataset was normalized and split into training, validation, and test sets.
- Standardization was applied to ensure better model performance.
- Logistic Regression: Chosen for its simplicity and interpretability.
- Random Forest: Selected for its ability to handle non-linear relationships.
- Neural Network: Implemented to capture complex patterns in the data.
- Bayesian Logistic Regression: Utilized to incorporate prior information and provide probabilistic predictions.
- The code is organized into functions and scripts to maintain modularity and readability.
- Standard coding practices such as PEP 8 were followed.
- Open and run
notebook.ipynb
to see the data exploration, preprocessing, and model training steps. - Follow the interactive prompts to choose and evaluate models on random samples from the MNIST dataset.
The project demonstrates the application of data science techniques and best coding practices to develop and evaluate predictive models.
If we talk about model selection it means that there is uncertainty about the true model and each elaborated model can be just a potential model. In order to be coherent with probability theory which is the one we will guide to the final selection we have to consider a Bayesian approach. We assume here that a priori all models have the same probability and based on the fit of each model in predicting each digit we calculate the model posterior probability. We have a total of
- We can select and justify the selection with the most probable model by considering the DL model with the largest inclusion probability and this concludes the selection of the model.
- If the goal is to predict, selecting a model makes almost no sense and better consider many models to predict the response. With model posterior probability we can predict the response by calculating the prediction from each of the
$2^p$ models and averaging these predictions according to their posterior model probabilities. This lead to a better prediction.
The Bayesian analysis automatically accomopades for:
- model goodness of fit: the more the model fit the data the more is probable among the model with the same complexity;
- model complexity: the more complex is the model and the less is probable among the model with the same goodness of fit.
In the end using Bayesian approach we can have the most probabile prediction and we also know how much is probable. We also know that from Hartigan 1966 the Bayesian approach is superior in terms of frequentist performance. In fact we need a less sample size to achieve a model that perform better in the test set than other models which are not based on the Bayesian logic and have the same degree of complexity.