MacarocoFonseca/Loan_Default_Prediction
1. Business Understanding In applying Machine Learning techniques to solve business problems, it is necessary to follow a project lifecycle in order to ensure that the model implemented is aligned with the objective and takes into consideration all the aspects of the business issue at hand. This lifecycle is typically the following: - Business Understanding - Data Understanding - Data Preparation - Modeling - Evaluation - Deployment However, this lifecycle cannot be considered as a linear process and, depending on the outcomes of each stage, it will be required from the analysts to go back to the previous stage and adapt their analysis until optimal results are obtained. For this reason, the business understanding part of the analysis is crucial, as it will set the objectives of the study and will enable to understand from a business perspective how each variable at hand can influence these objec-tives, how real-life events and phenomena might affect the outcome and how to account for the variability and unpre-dictability of financial and business data. In this study, we are attempting to solve the issue faced by financial institutions in assessing the likelihood of default of potential borrowers in order to take the right decision in approving or rejecting loan applications. Our business under-standing objective is to determine which of their clients will default or not on a loan. 5 This should be done through the analysis of data provided by clients or collected from historical performance. Variables such as the income of the borrower, the number of years they have been employed and their home ownership can provide information to firms on the financial strength of borrowers and whether or not they have collateral as an insurance against default and are important determinants of whether or not a borrower will default. Historical credit information are also crucial to determine the current likelihood of default, as patterns are most likely to be repeated. Hence, banks can use previous credit ratings and the number of loans already outstanding in order to determine if the borrower has been reliable in the past or if it is current overleveraged, and therefore likely to default on new loans. Finally, the nature, amount and term of the loan a borrower is applying for can also determine their likelihood of default as loans with a longer maturity, higher principal or for certain purposes can lead to higher default probabilities. All these variables should be taken into consideration and have a relationship with the default outcome of a borrower. Applying a supervised learning algorithm will allow us to use existing datasets containing this information and observe if and how they relate to the status of existing associated loans during the same timeframe as we would consider for future predictions. Therefore, we could train a model on existing historical data containing the previous feautures. This data would need to be collected from at least one period prior to our prediction date and equal or higher to the time needed for the variables to affect our target. The variables selected should always be available to us at the time where predictions would have to be done (hence at the time a loan is requested). As an example, the outstanding amounts on a loan or the number of loans already paid could not be used as predictors for default. Finally, the model trained would be evaluated on a test dataset for which the outcome is already known, in order to verify the accuracy of our predictions, before being deployed to predict default for future loans. Our data mining objec-tives are to predict with high accuracy and F1-Score the binary outcome of the model and the probability of default. Accuracy and F1-Score are most appropriate because both false positive and false negative values should be minimized as much as possible. Indeed, a bank would lose money by approving a loan to a defaulting customer, or by losing the business of a non-defaulting customer that would be falsly flagged as a defaulter. We will also attempt to maximize our ROC-AUC score, which is an important metric that represents the ability of a model to distinguish between different classes. The results of the evaluation and deployment stages might lead us to reassess the data understanding, preparation and modeling stages in order to fine-tune our results. Our analysis will start with the description and analysis of our dataset as developed in the next part.
Jupyter Notebook