Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Binning numerical variables in credit risk models

Thesis title in Czech:	Diskretizace numerických proměnných v modelech kreditního rizika
Thesis title in English:	Binning numerical variables in credit risk models
Key words:	Kreditní riziko, diskretizace, strojové učení, výkonnost
English key words:	Credit risk, binning, machine learning, performance
Academic year of topic announcement:	2021/2022
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Economic Studies (23-IES)
Supervisor:	doc. PhDr. Jozef Baruník, Ph.D.
Author:	hidden - assigned by the advisor
Date of registration:	15.06.2022
Date of assignment:	15.06.2022
Date and time of defence:	21.09.2023 09:00
Venue of defence:	Opletalova, O105, místnost č. 105
Date of electronic submission:	30.07.2023
Date of proceeded defence:	21.09.2023
Opponents:	prof. PhDr. Petr Teplý, Ph.D.

References

de la Bourdonnaye, F. & F. Daniel (2021): “Evaluating categorical encoding methods on a real credit card fraud detection database.” ArXiv abs/2112.12024.

Navas-Palencia, G. (2020): “Optimal binning: mathematical programming formulation.” arXiv abs/2001.08025. Available at http://arxiv.org/abs/2001.08025.9

Potdar, K., T. Pardawala, & C. Pai (2017): “A comparative study of categorical variable encoding techniques for neural network classifiers.” International Journal of Computer Applications 175: pp. 7–9.

Putri, N. H., M. Fatekurohman, & I. M. Tirta (2021): “Credit risk analysis using support vector machines algorithm.” Journal of Physics: Conference Series 1836(1): p. 012039.

Sharma, D. (2011): “Evidence in favor of weight of evidence and binning transformations for predictive modeling.” SSRN Electronic Journal

Weed, D. L. (2005): “Weight of evidence: A review of concept and methods.” Risk Analysis 25(6): pp. 1545–1557.

Preliminary scope of work in English

Modeling the probability of default (PD) is an essential part of banks’ risk management. Therefore, there have been countless attempts to devise new and better strategies to accurately capture the likelihood that a borrower will not be able to meet their loan obligations. This thesis aims to investigate whether categorizing numerical variables may improve the performance of models estimating PD.

Apart from good performance, a desirable trait of credit risk models is interpretability. As a result, the utilization of machine learning models is limited since they are mostly difficult to explain. Consequently, the categorization of numerical variables has been gaining popularity since it allows for the segregation of customers into homogenous bins. In addition, categorical variables are often handled using the Weight of Evidence (WoE) transformation which conveniently treats missing values and outliers. Binning enables the utilization of this technique even for numerical variables. Nevertheless, the effect on the model’s ability to accurately estimate PD is yet to be inspected.

Given the rapid advancements in computer technology in recent years, machine learning (ML) methods have been extensively utilized across all fields including credit risk modeling. In addition, combined with the increasing availability of large amounts of data, the development of PD models is often of a more quantitative rather than qualitative nature. Therefore, apart from the essential need to transform categorical variables to ensure the feasibility of model estimation, the binning and subsequent encoding of numerical features has become a common practice. This thesis intends to evaluate whether this process improves the performance of ML models or not.

To our best knowledge, the literature in this regard is relatively scarce. Nevertheless, Sharma (2011) finds supportive evidence for binning numerical variables and subsequent WoE transformation for the logistic regression and random forests models. This thesis will attempt to extend the analysis by employing additional ML methods and various types of categorical encoding.

Methodology

This thesis will employ the following machine learning methods to test the specified hypotheses: logistic regression, random forests, support vector machines, and artificial neural networks. In the credit risk industry, logistic regression is by far the most popular method due to its interpretability. However, as stated above, the advancements in computational power allow for more complex and demanding methods.

To measure the models‘ performance, this thesis will utilize the standard evaluation metrics including the Area Under Curve, Kolmogorov-Smirnov statistics, and F-score. For each machine learning method, three models will be estimated, one with raw numerical variables, one with binned numerical variables transformed using WoE, and one with binned numerical variables transformed using one hot encoding. For all models, the optimal hyperparameters will be found using grid search and cross-validation.

In regards to the binning of numerical variables, this thesis will utilize the optimal binning algorithm devised by Nayas-Palencia (2020) which is implemented in the python library OptBinning. Subsequently, the categorized variables will be transformed using the weight of evidence transformation as well as the dummy variable transformation and the results will be compared.

Lastly, the models will be estimated and evaluated on the publicy available data set published by the Home Credit Group which contains loan-level application data. The data is split into several tables which will need to be compiled to form the final data set. The final data set will then be split into training, validation and testing parts. The model development will be performed using the training and validation part while the final performance measures will be calculated on the testing part.

Expected contribution

This thesis aims to enrich the extant credit modeling literature by inspecting the effects of binning and subsequently encoding numerical variables on model performance. More specifically, it will evaluate the effect of utilizing the optimal binning procedure devised by Navas-Palencia (2020). To the best of our knowledge, the evaluation of this procedure with respect to multiple machine learning methods in the context of credit risk modeling is a missing piece in the existing literature. Therefore, the results of this thesis may have important implications for the development of models predicting the probability of default.

Outline

1. Introduction - Motivation, overview structure of the thesis
2. Literature review – Careful review of the existing literature
3. Methodology – Description of utilized models, transformations, optimal binning procedure, evaluating metrics
4. Data description – Description of the utilized data set
5. Results – Discussion of the results
6. Conclusion – Summary of the results, limitations, and opportunities for further research