Step 9 – Analysis of outliers

PostedMarch 26, 2025

UpdatedApril 6, 2025

Bywpusername7953

Purpose

This implementation follows step by step the contents of Chap. 4, section 4.5.8 : Analysis of outliers, pp. 162-164

Method

The authors base their approach on the use of the “interquartile range” (IQR). More precisely they name “outliers”, those samples which values are higher than the 3rd quartile (Q3) and lower than the 1st quartile (Q1).

Computing the number of outliers for the ration ROE (page 163) – -> (https://github.com/MoiraCorp/DLMM-IRating-in-R/tree/main/steps/step9/roeoutlr)
Computing outliers percentage for all the ratios – -> (https://github.com/MoiraCorp/DLMM-IRating-in-R/tree/main/steps/step9/alloutlr)
Masking outlier’s values using the NA R notation – -> (https://github.com/MoiraCorp/DLMM-IRating-in-R/tree/main/steps/step9/naoutlr)

Computing the number of outliers for the ration ROE (page 163)

Following the authors’s approach, we implement the FindOutliers() function in order to to detect extreme outliers

FindOutliers <- function(data) { lowerq = quantile(data, na.rm=TRUE)[2] upperq = quantile(data, na.rm=TRUE)[4] iqr = upperq – lowerq # we identify extreme outliers extreme.threshold.upper = (iqr * 3) + upperq extreme.threshold.lower = lowerq – (iqr * 3) result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower) }

For more information on the detection of outliers in R, see for example: “Outliers detection in R” -> https://statsandr.com/blog/outliers-detection-in-r/

The FindOutliers() function is used to evaluate the percentage of extereme outliers for the ROE variable

extreme.outl <- FindOutliers(wcs2train$ROE) # compute percentage of extreme outliers outlprc = length(extreme.outl)/length(wcs2train$ROE) outlprc

The printed output is:

[1] 0.1210692

NOTE : This compares closely with the percentage of 12% indicated by the authors in page 164

Computing outliers percentage for all the ratios

We start by sub-setting the ratios from the W_CS_1_AnalysisSampleDataSet_2B.xls table

ratiovars <- c(86:119)
wcs2train.ratios <- wcs2train[ratiovars] sapply(wcs2train.ratios, class)

The percentage of outliers is then computed for all the Ratio variables

for(i in 1:length(wcs2train.ratios)){
# use the function to identify extreme outliers
extreme.outl <- FindOutliers(wcs2train.ratios[,i])
# compute percentage of extreme outliers
outlprc = length(extreme.outl)/length(wcs2train.ratios[,i])
cat(sprintf(“%s, %.2f\n”, colnames(wcs2train.ratios)[i], outlprc))
}

Highest outliers percentage:

Ratio	Percentage	Ratio	Percentage	Ratio	Percentage	Ratio	Percentage
EXTRIC	0.24	SALESONV	0.13	ROE	0.12	INVENTOR	0.1

NOTE : The results show that: ROE, INVENTOR, EXTRIC and SALESONV do have “outliers” percentage >= 10%

Lower outliers percentage:

Ratio	Percentage	Ratio	Percentage	Ratio	Percentage	Ratio	Percentage
DEBTEQU	0.09	EBITDAIE	0.09	RECEIVAB	0.09	ROETR	0.09
DEBTEQUTR	0.08	SALESMIN	0.07	ROAMINUS	0.06	V95A	0.06
EQUILIABL	0.05	EQUITYON	0.05	IEONEBIT	0.05	IEONFINA.	0.05
NIEONEBI	0.05	ROA	0.05	COMMERCI	0.04	CURRENT	0.04
PAYABLES	0.04	ROI	0.04	ROS	0.04	V89A	0.03
EBITDAON	0.03	QUICKRA	0.02	TAXESONG	0.02	ASSETSTU	0.01
V94A	0.01

NOTE : It worth pointing out that a few NA (Not available) values are present, principaly in the TAXESONG ratio

Anomalous outliers percentage:

Ratio	Percentage	Ratio	Percentage	Ratio	Percentage	Ratio	Percentage
IEONLIAB	0	INTANGIB	0	TRADERE.	0	V110A	0
TRADEPA.	0

NOTE : The results show anomalies in the computation of the quantiles for IEONLIAB, INTANGIB, TRADERE., V110A and TRADEPA:

for IEONLIAB it is probably due to the presence of one extreme outlier (value 107.47) for a population ranging 0-23
for all the other ratios the result is harder to explain

Masking outlier’s values using the NA R notation

It is common practice to “mask out” outlier values for each of the Ratio variables in order to manipulate more manageable statistical distributions.
It follows the recommendations of the authors in section 4.5.9.1 – Treatment of outliers, page 164.
Here we will mask out the outliers by replacing these values by the NA R notation (Not Available)

Encoding Outliers values as NA

wcs2train.ratios.NA <- wcs2train.ratios
for(i in 1:length(wcs2train.ratios.NA)){
# use the function to identify extreme outliers
extreme.outl <- FindOutliers(wcs2train.ratios.NA[,i])
# Replacing extreme outliers values by NA
wcs2train.ratios.NA[,i][extreme.outl] <- NA
cat(sprintf(“%s\n”, colnames(wcs2train.ratios)[i]))
}

In order to graphically evaluate the effects of this ecoding we use the chart.Correlation() from the PerformanceAnalytics R package -> https://cran.r-project.org/web/packages/PerformanceAnalytics/index.html

library(“PerformanceAnalytics”)
chart.Correlation(wcs2train.ratios.NA, histogram=TRUE, pch=19)

The result is illustared in Table_4_21b_Page164_AllvariableswithNA_CoorDiag.pdf

NOTE : In this chart, the cloud points provide far more information than the one presented in the unprocessed datable with outliers present (-> https://github.com/MoiraCorp/DLMM-IRating-in-R/tree/main/steps/step8/alternr)

Recomputing the Pearson correlation matrix with outlier values removed

In order to compute the matrix of p-value, we use the custom cor.pvalue() R function already introduced in chapter “step8”: -> https://github.com/MoiraCorp/DLMM-IRating-in-R/tree/main/steps/step8/selectvar

# Function computing the matrix of p-values
# mat : is a matrix of data
# … : further arguments to pass to the native R cor.test function
cor.pvalue <- function(mat, …) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n – 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], …)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}

In order to produce a graphics representation of the Pearson correlation between all NA masqued Ratio Variables we are using the corrplot R package: -> https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

library(corrplot)
corrprs <- cor(wcs2train.ratios.NA, use=”pairwise”, method=”pearson”)
p.mat <- cor.pvalue(wcs2train.ratios.NA)
col <- colorRampPalette(c(“#BB4444”, “#fcc3b8”, “#FFFFFF”, “#add2f7”, “#4fc69d”))
corrplot(corrprs, method=”color”, col=col(200),
type=”upper”,
addCoef.col = “black”, # Add coefficient of correlation
addCoefasPercent = TRUE,
tl.col=”black”, tl.srt=45, #Text label color and rotation
# Combine with significance
p.mat = p.mat, sig.level = 0.01, insig = “blank”,
# hide correlation coefficient on the principal diagonal
diag=FALSE
)

The graphics representation of the Pearson correlation between all Na masqued Ratio Variables is presented in Table_4_21c_Page 164_RatioswithNA_Correlation.pdf

NOTE : When comparing with the same diagram obtained in step 8 -> https://github.com/MoiraCorp/DLMM-IRating-in-R/tree/main/steps/step8/allvar
it appears that, though groups of correlated variables do appear again in the new display, there are some remarquable differences

Updating groups of corralated variables with outlier values removed

GR1: this ROE, ROETR, DEBTEQUTR group does not appear any more

GR2: EBITDAON, V89A (97% with EBITDAON), ROS (92% with EBITDAON)

Column in R table	Code in text	Description
EBITDAON-87	EBITDAonSALES	Ratio EBITDA/Sales [%]
V89A-90	EBITDAonVP	Ratio EBITDA/Value of Production
ROS-91	ROS	Ratio EBIT/Sales [%]

GR3: ROI, ROA (83% with ROI) with no correlation with ASSETSU or IEONLIAB</em/

Column in R table	Code in text	Description
ROI-88	ROI	Ratio EBIT/Operating Assets [%]
ROA-89	ROA	Ratio Current Income/Total Assets [%]

GR4A: V94A, TRADERE. (70% with V94A) with no correlation with V95A or COMMERCI

Column in R table	Code in text	Description
V94A-95	RECEIVABLES_PERIOD	Ratio Trade Receivables/Daily Sales
TRADERE-106_	TRADE_RECEIVABLESonTA	Ratio Trade Receivables/Total Assets [%]

GR4B: V95A, V110A (75% with V95A) with no correlation with V4A or COMMERCI

Column in R table	Code in text	Description
V94A-95	RECEIVABLES_PERIOD	Ratio Trade Receivables/Daily Sales
V110A-107	INVENTORIESonTA	Ratio Inventories/Total Assets [%]

GR5: IEONEBIT, NIEONEBI (95%)

Column in R table	Code in text	Description
IEONEBIT-99	IEonEBITDA	Ratio Interest Expenses/EBITDA [%]
NIEONEBI-100	NIEonEBITDA	Ratio Net Interest Expenses/EBITDA [%]

Company Default prediction - DLMM Internal Rating Model in R