Company Default prediction - DLMM Internal Rating Model in R
- Steps followed to implement the DLMM Model in R language
- Step 1 – Converting SPSS formatted data
- Step 2 - One by one empirical analysis of variables
- Step 3 - Cross-tabulation 01STATUS versus Industry Sector Code
- Step 4 - Exploring graphically the probability distribution of a variable
- Step 5 - Testing the normality of the probability distribution of a variable
- Step 6 - Evaluating the good/bad discriminant power of a variable
- Step 7 - Empirical monotonicity of ROE relative to good-bad progression
- Step 8 - Correlation between variable couples
- Step 9 - Analysis of outliers
- Step 10 - Data encoding
- Step 11 - Synoptic table of variable properties
- Step 12 - Linear Discriminant Analysis - Initial approach
- Step 13 - Experimenting with Stepwise Linear Discriminant Analysis
- Step 14 - Gaussian Copula encoding scheme
Step 13 – Experimenting with Stepwise Linear Discriminant Analysis
Purpose
When LDA uses a large number of predicting variables (here, ratios), the stepwise method can be useful by automatically selecting the “best” variables to use in the model. The stepwise method starts with the variable which separates the groups the most and follows by adding new variables in such a way that a global level of separability criteria is maintained.
The Wilk’s Lambda criterion is usually the adopted separability criterion. (see -> https://en.wikipedia.org/wiki/Wilks%27s_lambda_distribution)
The Wilk’s Lambda criterion scale of ranges from 0 to 1, where 0 means total discrimination, and 1 means no discrimination. Each new independent variable is tested by putting it into the model and then taking it out, thus generating a Λ (Lambda) statistics which is a measure of association between the new variable and the group of those already present in the model.
The significance of the change in Λ between the new variables to be added to the xisting group of variables is measured with an F-test (idem, ANOVA Fisher Test on the equality of means). If the F-value is greater than the critical value (by default, 3.84), the variable is added in the model.
IMPORTANT NOTES:
- The ANOVA Fisher test implies that the predicting variables are Normally distributed (idem, display a Gauss shaped probability density)
- The Stepwise produres implies that the predicting variables are independent variables. This necessitates a preprocessing phase which is usually conducted in LDA by the determination of the best discriminant feature space. The axes of this feature space are usually the predicting variables used by the Stepwise LDA method (ex: LD1,LD2 in the illustation presented at -> https://github.com/MoiraCorp/DLMM-IRating-in-R/tree/main/steps/step12).
Method
Stepwise LDA applied on original data – -> (https://github.com/MoiraCorp/DLMM-IRating-in-R/tree/main/steps/step13/stepwise)
Stepwise LDA applied on authors recommended preprocessed data – -> (https://github.com/MoiraCorp/DLMM-IRating-in-R/tree/main/steps/step13/authstepw)
Stepwise LDA on original data
In klaR R package, the function greedy.wilks performs Stepwise LDA using the Wilk’s Lambda criterion, method used by the authors (page 190)
The kLAR package is downloaded from: https://cran.r-project.org/web/packages/klaR/index.html and locally installed. It also needs as dependencies the following other packages: Also needs as dependency: MASS, combinat, questionr and MiniUI
We use the cbind() and lda() functions from the MASS R package -> https://cran.r-project.org/web/packages/MASS/index.html The combinat package is downloaded from: -> https://cran.r-project.org/web/packages/combinat/index.html The questionr package is downloaded from: -> https://cran.r-project.org/web/packages/questionr/index.html The MiniUI package is downloaded from: -> https://pbil.univ-lyon1.fr/CRAN/bin/windows/contrib/3.4/miniUI_0.1.1.1.zip
library(klaR) wcs2trainL <- cbind(wcs2train$BADGOOD, wcs2train.ratios) names(wcs2trainL)[1] = “BADGOOD” w <- na.omit(wcs2trainL) gw_obj <- greedy.wilks(BADGOOD ~ ., data = w, niveau = 0.1) gw_obj
Formula containing included variables: BADGOOD ~ ROETR + IEONLIAB + EQUITYON + V110A + TRADEPA. + ASSETSTU <environment: 0x0000000004640b30>
Values calculated in each step of the selection procedure:
| Ratios | Wilks.lambda | F.statistics.overall | p.value.overall | F.statistics.diff | p.value.diff |
|---|---|---|---|---|---|
| ROETR | 0.9813240 | 24.07480 | 1.047297e-06 | 24.074799 | 1.047297e-06 |
| IEONLIAB | 0.9628733 | 24.36883 | 4.127506e-11 | 24.220942 | 9.721454e-07 |
| EQUITYON | 0.9554423 | 19.63360 | 1.921645e-12 | 9.822947 | 1.763140e-03 |
| V110A | 0.9521049 | 15.87107 | 1.108036e-12 | 4.423778 | 3.563887e-02 |
| TRADEPA. | 0.9489687 | 13.56220 | 6.535577e-13 | 4.167383 | 4.141703e-02 |
| ASSETSTU | 0.9453482 | 12.14038 | 2.639645e-13 | 4.825591 | 2.822170e-02 |
We follow by using the “suite” of Ratio Variables determined by stepwise LDA
The printed output is:z <- lda(BADGOOD ~ ROETR + IEONLIAB + EQUITYON + V110A + TRADEPA. + ASSETSTU, data=w, prior = c(1,1)/2, CV = TRUE) tab <- table(w$BADGOOD, z$class) tab
| “Bad” | “Good” | |
|---|---|---|
| “Bad” | 6 | 45 |
| “Good” | 118 | 1098 |
NOTE : The result somewhat identical to that of the first run of full LDA on original data with a slight improvement on the GOOD class
Although the list of variables by declining Wilks.lambda looks similar to that of the author in Table 4.28 Stepwise statistics (page 190) it is not fully identical. That of the author is:
| Ratios | Wilks.lambda | F.statistics.overall |
|---|---|---|
| ROETr | .981 | 24.075 |
| InterestExpenses/Liabilities | .963 | 24.369 |
| INVENTORY._PERIOD3 | .955 | 19.854 |
| Equity/Permanent Capital | .947 | 17.520 |
| EQUITYon PERMANENT_CAPITAL3 | .935 | 17.622 |
| SALESBin2cll | .931 | 15.573 |
| ROA-(InterestExpenses/ TotalLiabilities) | .928 | 13.991 |
NOTE : One can see that the authors have added 3 ad-hoc “category” (or, “binned”) variables which are: INVENTORY_PERIO3, EQUITYonPERMANENT_CAPITAL3 and SALESBin2c11
Applying authors recommended stepwise LDA performed on datatable wcs8train
In order to compare method results on an objective ground, one needs to perform stepwise LDA on the same preprocessed datatable as the authors (as it was already done in step 12 for regular LDA). From Page 188, the authors signal that they are using the datatable: W_CS_1_AnalysisSampleDataSet_8MDA.sav
library(haven) wcs2train8MDA <- read_sav(“C:/Projets_En_Cours/AI_MTPL/UCI_Internal_Ratings/SPSS-PASW/W_CS_1_AnalysisSampleDataSet_8MDA.sav “) write.csv(wcs2train8MDA, file = “C:/Projets_En_Cours/AI_MTPL/UCI_Internal_Ratings/SPSS-PASW/ W_CS_1_AnalysisSampleDataSet_8MDA.csv”)
After this operation, the file is re-loaded after shortening the variable names following our conventions:
wcs8trainMDA <- read.csv(“C:/Projets_En_Cours/AI_MTPL/UCI_MTPL_Internal_Ratings/SPSS-PASW/W_CS_1_AnalysisSampleDataSet_8MDA.csv”, header=TRUE, sep=”,”)
For the purpose of conducting Stepwise LDA, we are using the klaR R package, the function greedy.wilks performs Stepwise LDA using the Wilk’s Lambda criterion, method used by the authors (page 190) The kLAR package is downloaded from: https://cran.r-project.org/web/packages/klaR/index.html and locally installed.
library(klaR) w <- na.omit(wcs8train.vars[c(1:54,56:74)]) gw_obj <- greedy.wilks(BADGOOD ~ ., data =w, niveau = 0.1)
Crashed on: Error in summary.manova(e2, test = “Wilks”) : residuals have rank 7 < 8 NOTE : If we DO NOT SPECIFY, “a priori” probabilities, the lda function displays the coefficients of the regression formula
w <- na.omit(wcs8train.vars[c(1:54,56:74)]) z <- lda(BADGOOD ~ ROETR + IEONLIAB + V95A3A + EQUITYON + EQUITYON3A + SALESBin2cl1 + ROAMINUS, data = wcs8train.vars, na.action = “na.omit”) Warning message: In lda.default(x, grouping, …) : variables are collinear table(w$BADGOOD, z$class)
The printed output is:
Prior probabilities of groups:
| “Bad” | “Good” | |
|---|---|---|
| 0.04009434 | 0.95990566 |
Coefficients of linear discriminants:
| LD1 | |
|---|---|
| ROETR | 9.964643e-05 |
| IEONLIAB | -7.999371e-02 |
| V95A3A | -3.570714e-03 |
| EQUITYON | -2.099401e-03 |
| EQUITYON3A | 8.076022e-03 |
| SALESBin2cl1 | 5.146604e-01 |
| ROAMINUS | -1.457695e-03 |
Comparative results on regression coefficients presented by author on page 194 as “Canonical discriminant function coefficients:
| Ratio | Our results | Authors’ results |
|---|---|---|
| ROOETR | 0 | 0 |
| IEONLIAB | -0.08 | 0.08 |
| V95A3A | -0.0036 | 0.004 |
| EQUITYON | -0.0021 | 0.002 |
| EQUITYON3A | 0.0008 | -0.008 |
| SALESBin2cll | 0.514 | -0.512 |
| ROAMINUS | -0.0014 | 0.001 |
NOTE : The RESULTS ARE IDENTICAL
We compute the final comparative cross-classification table
The printed output is:z <- lda(BADGOOD ~ ROETR + IEONLIAB + V95A3A + EQUITYON + EQUITYON3A + SALESBin2cl1 + ROAMINUS, data= wcs8train.vars, na.action=”na.omit”, prior = c(1,1)/2, CV = TRUE) tab <- table(wcs8train.vars$BADGOOD, z$class) tab
| “Bad” | “Good” | |
|---|---|---|
| “Bad” | 14 | 37 |
| “Good” | 142 | 1079 |
NOTE : VERY SIMILAR to the one obtained by the authors (page 201)