EDA – Exploratory Data Analysis

Introduction

In the US in 2023, about 1% of mortagages are considered sub-prime (down from 2% in 2022) according to the Office of the Comptroller’s OCC Mortgage Metrics Report. The total residential mortgage debt was about 12 trillion USD. This means that “bad loans” represent about 120 billion USD. According to Global Credit Data, a private organisation that owns the key Loss Given Default (LGD) metric of a wide range of banks, in North America the recovery rate in 2023 was 84%, i.e. 84% of the value of a loan could be recovered after a default. Therefore, the total cost of the industry for defaulting loans is in the range of 20 billion USD per year. In addition to the direct loss of profit, banks are also affected by loan defaulting rates because their loss given default rate LGD (= 1-recovery rate) is a key measure how many reserves a bank must have on their balance sheet. Defaulting loans therefore represent a very significant cost to banks. In addition, a bank with a lower defaulting rate will have better profits from home loans and therefore potentially be able to lend money at a lower interest rate, increasing their market share of the home loan market.

It is therefore of triple interest to reduce the number of defaulting loans – while not diminishing the number of conceded loans. This calls for a more precise approval process and elimination of human bias and erroneous judgment. It is important to note that the approval process is conditioned by the Equal Credit Opportunity Act which makes it unlawful to base such an approval process on factors such as race, color, religion, national origin, sex etc.

Objective

The objective of this project is to use the Home Equity Dataset (HMEQ) to build a model that is able to predict a potentially defaulting loan application with precision. More weight will be given to incorrect predictions that a loan will be repaid than to incorrect prediction that a loan will indeed default, i.e. we will value loss reduction over revenue increase. Predictions need to be interpretable to (i) prove compliance to the Equal Credit Opportunity Act and (ii) be able to justify a rejection.

The key questions for this study are:

What are the main factors that contribute to loan defaulting?
Are we able to predict with good precision whether a particular loan application should be rejected?

The data set

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.

BAD (target) 1 = Client defaulted on loan, 0 = loan repaid
LOAN (numerical) Amount of loan approved.
MORTDUE (numerical) Amount due on the existing mortgage.
VALUE (numerical) Current value of the property.
REASON (nominal) Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB (nominal) The type of job that loan applicant has such as manager, self, etc.
YOJ (numerical) Years at present job.
DEROG (ordinal) Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ (ordinal) Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE (numerical) Age of the oldest credit line in months.
NINQ (ordinal) Number of recent credit inquiries.
CLNO (ordinal) Number of existing credit lines.
DEBTINC (numerical) Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.

Of the 13 features, 11 have missing values. The debt-income ratio has the most missing values with about 20%. Although this is not a very high number, it is not the best strategy to drop rows with missing values

Imputation seems a much better strategy. Spoiler alert: we’ll use KNN for imputationm but we’ll store the information that a column has a missing value in a new feature, so that the classification algorithms can take that into consideration.

The data set is also somewhat imbalanced with 20% defaulting loans; we

Univariate analysis

As one can see above

Most loans are used for debt consolidation; however, the defaulting percentage seems relatively similar for both reasons
Most applicants have “other” as job description; it is not obvious from the visualisation whether one of the other job category is linked to higher default rates
Most applicants have no derogatory reports or delinquent credits. However, even with no derogatory reports, some credits default.
NINQ (number of inquiries) and DELINQ (the delinquency rate) seem to be exponentially distributed
CLNO (number of credit lines) seems to be normally distributed with long tails

The mortgage due has a median at 65,000USD and an average at 74,000USD. Clearly, the feature is not normally distributed, left skewed and has a significant amount of outliers.

The debt-to-income ratio has an average and medium at 34% and outliers both at the low and the high end. There are only very few records above 100%. The distribution seems bi-normal. However, that require a more in-depth investigation.

When plotting DEBTINC for the two different classes, we can clearly see that non-defaulting loans have a debt-to-income ratio of less than 45% (with a median of 35%). All values of the debt-to-income ratio beyond 45% correspond to defaulting loans. Defaulting loans have a normal distribution around 38% with a long tail for high values.

Applicants had 7 years on the job on median and 9 on average. There are some outliers at the high end (with up to 40 years on the job). The highest peak, however, is on applicants which have recently started a new job.

The loan amounts are quite small with an average of 16-19,000USD. Even the whisker value of just over 40,000USD is very small. The distribution might be normal with a very long tail.

Regarding the age of the credit line, the distribution is potentially bi-Gaussian with a maximum at around 100 months and another peak at the mean and median of 175 months. Interestingly, defaulting loans peak at a 100 months; risk seems to reduce with longer-standing loans, likely because once the loan repayment is established, it is easier to continue until the end. Please note that some credit lines have extreme lengths of 50 years; records with 100 years credit lines are most likely faulty and should be removed from the data set (or grouped together in a “very long credit lines” category).

The current value of the property is again unreasonably small – most likely the data set is very outdated. A median of 90,000USD and an average of 100,000USD is surely not realistic anymore. The distribution is probably normal-like with a long tail towards high (but not very high) values. The maximum is 855,000USD. Defaulting and non-defaulting loans have roughly the same distribution, which means that very likely the value is not a good indicator for loan risk.

In order to test whether the years on the job and the number of credit inquieries are exponentially distributed, we use a log plot and fit a regression line through the data. As can be seen above, the distribution is compatible with the assumption that it is exponential in nature. At the high end of both distribution, there are some deviations.

Similarly, one might think that the deliquency rate and the number of derogatory reports is distributed according to a power law, but the log-log-plot suggests a more complex relationship.

Clearly the number of delinquent credit lines and to a lesser extent the derogatory reports are correlated with loan defaulting. Both have “threshold values” beyond which we only have representatives of one class. For the number of delinquent credit lines, a creditor with more than 5 delinquent credit lines is sure to default (again). For derogatory report, more than 7 derogatory reports also make defaulting certain.

Bi-variate analysis

Nominal features

The proportion of defaulting loans is slightly higher for home improvement loans when compared to debt conversion. Sales persons have the highest proportion of defaulting loans, followed by professionals and managers. Office workers (with relatively safe jobs) have the lowest proportion of defaulting loans.

Ordinal features

DEROG, DELINQ and NINQ was discussed already. For the number of credit lines, we have threshold values both for very small and very large values. Loans holders with very few credit lines are almost sure to default; probably this corresponds to persons with weaker finances. On the other end of the spectrum, loans for persons with a very large number of credit lines (>57) will default with certainty. This probably corresponds more to “gambler”, i.e. persons who are living above their possibilities and try to finance this with lots of credit.

Numerical features

Loan defaulters have on median:

smaller, younger loans
less mortgage due
a lower debt income ratio
lower value property
less years on their current job

The Cramer V test showed a phi coefficient between JOB and REASON of 0.15. A low value of the Cramer’s V test indicates little or no association between the features. Such variables can be assumed to be independent.

The Spearman correlation matrix between the ordinal features is depicted on the right.

As can be seen, none of the ordinal features are closely correlated.

From the Pearson correlation matrix, we learn:

There is a strong correlation between the mortage due and the value of the property. Having both features will affect the quality of the regression. If we wanted to use the regression coefficients to understand the importance of features, we should remove one of the columns. However, decision trees are not affected by multicolinearity and therefore there is no need to remove the feature.
There are no strong correlations between the other numerical variables
The strongest correlations are between (none of them exceed 0.23):
- LOAN and VALUE (indicating that there are properties which are largely financed by loans)
- LOAN and MORTDUE (so higher willingness to indebt oneself is correlated with high loan amount)
- CLAGE and YOJ (so keeping the same job is correlated with the age of the credit line – possibly indicating that people stay in their jobs to pay back the loans)

Observations of the pairplot

Loan amount and total mortgage are the only features that are obviously correlated
We can confirm the other correlations from the Pearson correlation, but clearly the weak correlations we observed before are difficult to identify visually
Given that these correlations are not particularly strong, there is no need to reduce the number of features at this stage.

Testing statistical relevance

Categorical features

For both nominal and ordinal features we use the chi2 test. For this, we calculate the contingency table and then conduct the test.

The result of the test is that all categorical variables ([‘REASON’, ‘JOB’, ‘DEROG’, ‘DELINQ’, ‘NINQ’, ‘CLNO’]) are statistically relevant. However, there is a caveat that (as seen above) there are values of the ordinal features where not both classes are represented. In those cases, the assumptions of the chi2 test are not given because the frequency for those classes is smaller than 5.

Numerical features

As we have seen above none of the features are really normally distributed, so conducting a t-test is not appropriate. Therefore we will conduct a Mann-Whitney U test. According to the Mann Whitney U test, the significant numerical features are: [‘LOAN’]. The non-significant numerical features are: [‘MORTDUE’, ‘DEBTINC’, ‘YOJ’, ‘CLAGE’, ‘VALUE’].

Multivariate analysis

First characterisation of defaulting loans using PCA

We will try to find some characteristics of the loan defaulting class. PCA requires encoding and scaling the features. A StandardScaler is used.

PC1 is professional with high value property, high outstanding mortgage, a number of credit lines and history of credits
PC2 is a sales person with major derogatory reports and a somewhat higher number of credit inquiries
PC3 is (probably) a manager with a lot of derogatory reports and a higher number of credit inquiries
PC4 is (probably) an office worker, manager or self-employed
PC5 is self-employed or a manager for years with old credits, los debt/income ratio using the credit for home improvements. Likely a contractor that suddenly runs out of business
PC6 is self-employed or other or a manager new on the job; probably the job didn’t work out
PC7 is a long employed office worker or a manager with high, old loan. Possibly elder people suddently getting laid off.

Conclusions

1) This analysis is based on the Home Equity dataset (HMEQ) with about 6,000 records of (unclean) data with 12 features. There is a significant amount of missing data (up to 21% for deb-to-income ratio). We discarded about 5% of records because more than 3 features were missing for the same record. The remainder were imputed using a k-nearest neighbour imputer. Outliers were mapped to +/- 1.5 IQR; outliers were marked in separate features to allow models to take that into consideration.

The data set is imbalanced with an 80:20 split between repaying and loan defaulting records.

2) Features are generally skewed, some of them heavily. Only some features have a distribution similar to a (skewed) normal distribution (VALUE, LOAN, CLAGE, MORTDUE, CLNO). DEBTINC (and possibly VALUE and CLAGE) might be distributed according to a bi-modal normal distribution. YOJ and NINQ seem to be exponentially distributed.

3) In features like derogatory reports, delinquency rate and number of credit inquieries, there are threshold values beyond which we only have records corresponding to loan defaulters. For the line of credit lines, both very low and very high numbers have a loan default majority class, indicating two different clusters of clients

4) There is significant correlation between loan amount and mortgage due. We decided to keep both features, sacrificing the interpretability of the coefficients of the LogisticRegression (but not its power of prediction). Tree models are not affected by multicollinearity.

5) A statistical test was conducted on features; all categorical features were found to be relevant for the target feature. For the numerical features, only the loan amount was deemed statistically significant.

6) On a macro scale, loan defaulters have smaller, younger loans, less mortgage due, a lower debt to income ratio and fewer years on the job. This seems to point at a group of loans corresponding to less well-off persons with less job stability.

7) Using PCA, we were able to create 7 “personae” to describe defaulting loans in more granularity.

The code is available on github.