Compliance Testing - Fairness Assessment using R
- Compliance Testing - Fairness Assessment using R
- Preprocessing - Semantic tagging using the LSEG-PermID (Open Calais) service
- Step1 - Load data and libraries
- Step 2 – Perform Principal Component Analysis (PCA) and evaluate clustering potential
- Step 3 - Perform K-means on a factor scores sub-space at evaluate performance with a minimum number of clusters
- Step 4 - Display retained clusters statistics
- Step5 - Evaluate fairness of Basing Hall-BIC selection process
In the present example, the data tables is located in directory: C:\Projets_En_Cours\AI_MTPL__BasingHall\25-May-2021\Results
write.csv(grp,”C:/Projets_En_Cours/AI_MTPL/__BasingHall/25-May-2021/Results/grp.csv”, sep=”,”)
NOTE -> The group vector grp is exported and then modified for transfer of a few odd companies into of a group 6 for outliers
Comparing the Retained and Dismissed groups of companies for testing the “fairness” of the selection process
This experiment is meant to test the usability of the company classification based on their PermID “Intelligent Tagging” scores. It is using the section of the sections of the questionnaire named: “Status”. It is aiming at providing an answer to the question:
“Does it exist a bias in the way Basinghall Partners retains or dismisses a particular startup after it has submitted its statement of interest?”
This is what we have called “fairness analysis”. In other words, based on the classification presented above, is it possible to demonstrate that the selection of dossiers processed is an unbiased (i.d., fair) process?
Determining the amount of similiraties between the Retained and Dismissed groups through their PermID characterisation
In order to test this hypothesis, we have separated the population of startups in two sub-groups:
- Dismissed: for Status, 21. Discontinued or 22. Dismissed
- Retained: for all the other values of Status
We compare each of these classes through the statistical properties of their corresponding K-means sub-groups. This is done in the factor space (2,3)
# Composite plot of ellipses with status
grp <- as.factor(grp_modified$GrwithStat)
# Here we want to have the ellipses of Status=Dismissed in grey
# See deriving color palette for 9 classes : http://www.sthda.com/english/wiki/colors-in-r
fviz_pca_biplot(occ.pca, axes = c(2, 3),
habillage = grp,
palette = c(“#FF0099”, “#993FFF”, “#0066CC”, “#33CCCC”, “#009966”, “#AAAAAA”,”#AAAAAA”, “#AAAAAA”, “#AAAAAA”, “#AAAAAA”, “#AAAAAA”, “#AAAAAA”),
addEllipses = TRUE)

This diagram shows that for 5 groups, the “dispersion ellipses” of the Dismissed sub-groups (showed in grey) overlap almost perfectly with the “dispersion ellipses” of that of original population of companies.
From these results it is concluded that the Basing Hall Partners process is reasonably “fair”
It is also common to represent each groups by their average vector of category scores.
# Statitics for each group of companies
grp_stat <- describeBy(OCC_wStatus[,c(4:14,15:17)],grp_modified$Group, mat=TRUE)
# Statitics for each group of companies broke down by Dismissed-Retained categories
gpr_statDR <- describeBy(OCC_wStatus[,c(4:14,15:17)],grp_modified$GrwithStat, mat=TRUE)

In the above table the highest values category scores are presented in yellow. It shows that all the companies’ activities involve Internet with added specificities:
- GR1: “Business-Finance”
- GR2: “Hospitality-Recreation with financing”
- GR3: “Helth_Pharmacy”
- GR4: “Environment Agriculture”
- GR5: “Sports”