rishabhpahuja/Credit-Score-

With the increasing demand for credit facilitated by technologies such as the credit card, modern day credit evaluation systems cannot rely on judgmental approaches that subjectively evaluate every individual. It is important to develop objective methods that are unbiased and can reliably and quickly give an overview of someone’s creditworthiness without actually pulling up their credit report. Furthermore, low-income and young individuals often do not have a credit score due to multiple factors such as lack of income history or improper documentation. In the US alone there are over 25 million Americans who were credit invisible in 2015 according to the CFPB [1]. Having such a system will allow the financial institutions to quickly judge the risk associated with lending to an individual. It will also give the individual an opportunity to monitor their credit score and take steps to improve it. Hence, we plan to build a credit score predictor to tackle this challenge. We will be using the data provided in [2] to build and test our models. The biggest difference between our methodology and those of typical credit score calculators, such as FICO, is that we plan to consider the parameters that any person can furnish irrespective of their income and credit history. In this framework our input data X will be parameters identifying the person’s financial status such as their current income, marital status, number of dependents, real estate ownership amongst others. The output data Y will be the predicted credit score of the individual. As our data set does not have predefined labels we will try to correlate our feature vectors to identify a pseudo label vector that can effectively distinguish between individuals with good and bad credit. We also plan to test out skewed subsets of our data that have a large number of only one type of label to see which method works best for such datasets. We also plan to try out different feature selection methods such as F-score, which tries to identify which feature is the best in terms of separating the data, and if time permits more sophisticated methods such as genetic algorithms. Several works show that correct feature selection greatly improves efficiency of the model [3-7]. Hence we believe that devoting resources to finding the optimum set of features will help to increase the efficiency of the model that we create. In the project, we shall also try to tackle the issue of imbalanced dataset since many literature reviews have shown that certain models cannot perform well due to imbalance in the dataset [8]. We shall be using several models to predict the credit score. We shall be using linear regression, logistic regression, Naïve Bayes, and some machine learning models, including k-NearestNeighbor (k-NN), Decision Trees (DTs), Support Vector Machines(SVMs). We expect a regression plot for the regression models, and probability curves for classification models. To evaluate our different models, we plan to use metrics such as Percentage Correctly Classified (PCC), Sensitivity/Recall, Type I Error, and Type II Error, and Receiver operating Characteristics amongst others [9]. Finally we hope to build a kit that is able to identify the optimum algorithm and its parameters based on the input data provided. This will help to automate the entire process for new sets of data and allow such an approach to be used with real problems. The members of our team are Aditya Rathi, Rishabh Pahuja and Vatsal Joshi.

Python