Classification – modelling loan defaults

Executive summary

In the US in 2023, about 1% of mortagages were considered sub-prime (down from 2% in 2022) according to the Office of the Comptroller’s OCC Mortgage Metrics Report. The total residential mortgage debt was about 12 trillion USD. This means that “bad loans” represent about 120 billion USD. According to Global Credit Data, a private organisation that owns the key “Loss Given Default” (LGD) metric of a wide range of banks, in North America the recovery rate in 2023 was 84% (https://globalcreditdata.org/wp-content/uploads/2023/06/GCD-CRE-RR-Report-2023.pdf), i.e. 84% of the value of a loan could be recovered after a default. Therefore, the total cost to the industry of defaulting loans is in the range of 20 billion USD per year. In addition to the direct loss of profit, banks (or non-bank lenders) are also affected by loan defaulting rates because their loss given default rate is a key measure how many reserves a bank must have on their balance sheet. Immobilising capital has a cost. In addition, a bank with a lower defaulting rate will have better profits from home loans and therefore potentially be able to lend money at a lower interest rate, increasing their market share of the home loan market.

Taking the direct loss of profits, the need for reserves and the opportunit cost together, defaulting loans represent a very significant cost to banks.

It is therefore of triple interest to reduce the number of defaulting loans – while not diminishing the number of conceded loans. This calls for a precise approval process and elimination of human bias and erroneous judgment. It is important to note that the approval process is conditioned by the Equal Credit Opportunity Act which makes it unlawful to base an approval on factors such as race, color, religion, national origin, sex etc (see e.g. https://www.law.cornell.edu/uscode/text/15/1691).

This study looks at building and evalulating prototype models to predict and explain defaulting loans in an automated fashion. Its concrete aims are to
(i) Identify reasons for loan defaulting in terms of personae and factors
(ii) Build a classification model for accurate loan default prediction to be considered for providing unbiased input into the loan concession process
Given the high cost of defaulting loans, we will pay particular focus on defaulting loans that were predicted to be repaying. (We are therefore interested in minimising “false negatives” which can be measured by maximising the recall metric.)

It is important to note that this model only uses data in accordance with the Equal Credit Opportunity Act of the United States. It takes into account amount of loan approved, amount due on the existing mortgage, current value of the property, reason for the loan request, type of job, years at present job, number of major derogatory reports, number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due), age of the oldest credit line in months, number of recent credit inquiries, number of existing credit lines and debt-to-income ratio. The data set has almost 6000 entries of which 20% are defaulted loans (significantly overrepresented with respect to US loan defaults). It is not known from which period or region these records were extracted.

The quality of the data set is mixed with up to 21% of data missing for the debt-to-income ratio; we removed the 5% of records with more than 3 missing features and proceeded to replace the remaining. After experimenting with a manual strategy, we finally replaced missing values via a clustering method (k-nearest neighbour) which produced probability distributions that were quite similar to the non-treated distributions. Almost all features were strongly skewed with significant outliers. Most features were clipped at the “whisker value” (±1.5IQR±1.5𝐼𝑄𝑅), but for some features the borders were manually determined after close analaysis of the data. Features were properly encoded according to their type and scaled if necessary.

The study was able to identify 7 personae of loan defaulters (via PCA). The seven personae are:

professionals with high value property, high outstanding mortgage, a number of credit lines and history of credits
salespersons with major derogatory reports and a somewhat higher number of credit inquiries
managers with a lot of derogatory reports and a higher number of credit inquiries
office workers, managers or self-employed persons
self-employeds or managers for years with old credits, low debt/income ratio using the credit for home improvements. Likely a contractor that suddenly runs out of business
self-employeds or those with other occupations or a managers new on the job; probably the job didn’t work out
long employed office workers or managers with high, old loan. Possibly elder people suddently getting laid off.

For prediction, the following models were tried on a 80:20 split between training and test data:

Logistic regression
Decision Tree (with default values, optimised hyperparameters and using over- and undersampling)
Random Forest (with default values and optimised hyperparameters)
XGBoost (with optimised hyperparameters, undersampling and feature reduction)
LightGBM (with optimised hyperparameters) Optimisation of hyperparameters was done using GridSearchCV which includes cross validation.

Both optimised boosting models overfitted on the training data in spite of our best efforts to choose appropriate hyperparameters. Even a reduction to only 5 of the 12 original features produced an overfitting XGBoost model (with 4 features we managed to avoid complete overfitting, but at the expense of a significantly worse performance on the test set). The modeling performance of these algorithms seems so strong that on our small data set it is almost unavoidable to overfit.

The XGBoost model, on the other hand, has shown to be very robust against optimisation, feature reduction and undersampling. For the same model, a “quality score” of the model (roc auc score) was over 97%. The model produced 0 false “will repay” predictions of defaulting loans on the test data set. In spite of it overfitting, we propose that this is still best model for prediction. We suggest trialling this model as secondary decision support in loan concessions and test it on more data before attempting to fully automate the loan concession process.

For feature importance analysis we used a standard analysis method based on the so-called shapley values. The top 5 features were:

New credits
High delinquency rate (and high derogatory reports)
High debt income ratio
A low loan value and a low property value
Either a very low or a very high number of credit lines

Finally, we analyse the set of loans that did not default in spite of our contrary prediction. These loans represent 4% of loans that could have been issued, so is a significant business opportunity – although they were not the primary focus of the study. These wrong predictions seem to be correspond to a small group of persons newer on their office jobs with limited financial resources and a lot of difficulties repaying loans in the past who however make ends meet and repaid the loan. It will not be easy to find them amongst similar persons where the loan does default. Perhaps an additional interview for persons of those characteristics can convert some of these leads into successful loan holders.

The code is available on Github