Compliance Testing - Fairness Assessment using R

PostedJune 22, 2025

UpdatedJune 27, 2025

Bywpusername7953

Purpose

The problem is here to test if in a VC firm, the data-driven process of startup dossier discovery and selection is unbiased and compliant with a declared principle of “fairness”. Apart from usual financial assessment, the data-driven selection is based on provided descriptions such as: value proposition, customer’s pain points and a list of top benefits for customers.

We propose here to replace the traditional cumbersome manual process of startup sourcing and screening by the use of a Machine Learning (ML) process based on a three steps process:

activity characterisation using a Natural Language Process (NLP) tagging system
followed by a K-means clustering algorithm capable of classifying the startups by their activity
and test if the selection/dismissal of their dossier is a “fair” process

Method

Preprocessing – Semantic tagging using the LSEG-PermID (Open Calais) service -> (https://github.com/MoiraCorp/Innovkg-exercise-km/tree/main/permid-preprocess)
Step1 – Load data and libraries in R -> (https://github.com/MoiraCorp/Innovkg-exercise-km/tree/main/step1)
Step2 – Perform Principal Component Analysis (PCA) and evaluate clustering potential in various factor scores subspaces -> (https://github.com/MoiraCorp/Innovkg-exercise-km/tree/main/step2)
Step3 – Perform K-means on a factor scores sub-space at evaluate performance with a minimum number of clusters -> (https://github.com/MoiraCorp/Innovkg-exercise-km/tree/main/step3)
Step4 – Display retained clusters statistics -> (https://github.com/MoiraCorp/Innovkg-exercise-km/tree/main/step4)
Step5 – Evaluate fairness of the startup dossier selection process -> (https://github.com/MoiraCorp/Innovkg-exercise-km/tree/main/step5)