Contains the notebooks for the first of two projects I undertook during my summer internship at Andrew Davidson & Co. Working under the supervision of AD-Co's Behavior Modeling team, I analyzed Freddie Mac's single family loan-level datasets, focusing on the years postdating the Subprime Mortgage Crisis. I implemented a random decision forest in scikit-learn to classify month-to-month delinquency and turnover risk in mortgages.
Some Financial Background:
Companies that buy and sell mortgage-backed securities have a considerable interest (get it?) in evaluating the risk of these assets, which may be made up of a myriad of individual home loans. The two main categories of mortgage credit risk are (i) delinquency, when the borrower stops paying, and (ii) early termination, when the borrower pays off the mortgage earlier than expected (thus closing off the lender's source of fixed income, viz. interest payments). The latter typically takes one of two forms: refinance (where the debtor takes out a second mortgage on more favorable terms to pay off the first) and turnover (where the borrower simply moves and sells their equity).
In general, refinance is the easiest form of credit risk to predict: assuming people behave at least semi-rationally, when interest rates drop, they'll refinance. [Hence the seemingly-backwards terminology of mortgages: from the mortgage holder's perspective, 'premium' is a lower-interest, lower-risk loan while 'discount' is a higher-interest, riskier loan.] Delinquency is somewhat harder to predict, but credit score-which borrowers are required to disclose at the government loan agencies--serves as a rough estimator of risk of borrower default. Turnover, however, is rather difficult: the decision that goes into moving into a new house is a complicated one that can arise from any number of factors, few of which are immediately apparent in the data.
The Approach: I chose to work with a Random Decision Forest classifier (RF) for several reasons. First, AD-Co's proprietary loan risk model is built on a logistic regression, and a RF algorithm is different enough so as to offer an alternative perpsective on the problem. Secondly, I wanted to experiment with adding new, unconvential predictors to the model, and RFs are generally resillient to extra variables (e.g., multicolinearity is not a major hazard.) An RF makes few statistical assumptions about the dataset, if any, and can handle 'raw' variables. Logistic regressions, on the other hand, assume a linear response from their predictors: in terms of classifications, this means the space of data has a linear decision boundary. To realize a logistic regression's true utility means careful engineering of features. Conversely, for a novice in the domain of mortgages like me, the RF model makes for a gentler introduction. The RF as implemented in scikit-learn comes with feature importance metric, useful for picking up insight. Lastly, on the level of intuition, I suspect that the behavior that goes into default, turnover, and refi resembles a decision tree on some cognitive level. Perhaps this means that a decision forest is well-suited to this type of problem.