Compliance Testing - Fairness Assessment using R
- Compliance Testing - Fairness Assessment using R
- Preprocessing - Semantic tagging using the LSEG-PermID (Open Calais) service
- Step1 - Load data and libraries
- Step 2 – Perform Principal Component Analysis (PCA) and evaluate clustering potential
- Step 3 - Perform K-means on a factor scores sub-space at evaluate performance with a minimum number of clusters
- Step 4 - Display retained clusters statistics
- Step5 - Evaluate fairness of Basing Hall-BIC selection process
What is LSEG-PermID?
LSEG is providing several services aiming at providing Permanent Identifiers (PermIDs) for tagging financial related information related to any product or company. These services are regrouped under the PermID “banner” ( https://permid.org/ ). These services are delivered in “Open” mode. They are intended to facilitate and standardize the information exchanged between the various stakeholders of the economics sector. This service inherits from an older product named “Open Calais” which was acquired from Reuters about 10 years ago ( https://en.wikipedia.org/wiki/Calais_(Reuters_product) )
This technology was originally developed by a US-Israel company named ClearForest which was later acquired by Reuters. The provided on-line service and its dictionary and data structures are built around a concept known as the “Semantic Web” ( https://en.wikipedia.org/wiki/Semantic_Web ). This technology and its standards were introduced in 2001 by Berners-Lee and his colleagues as a way of describing on the Web the relations between the “meaning” of published data or information.
What is the semantic tagging service offered by PermID?
In its present PermID version, the “Open Calais” engine extracts semantic information from any provided text in .txt, .pdf, .xml or .html format. It uses advanced natural language processing technologies in order to confront the submitted text with a huge semantic database and returns a list of the most significant keywords found to relate to the submitted document. These keywords are organized in several sets: Category, Industry, Social Tags and Individuals.
Using PermID “Intelligent Tagging” on the VC “Statement of Interest” Questionnaire
The preprocessing phase uses the Register of Interest database generated from the on-line questionnaire submitted to each startup
For each company in the database, semantic tags identified using the LSEG-PermID Intelligent Tagging (Open Calais) service from the juxtaposition of 3 descriptive fields present in the database.
These fields are respectively:
- 2.2 Value Proposition
- 2.3 Customer’s pain points
- 2.4 TOP benefits for customer
| Actions | Example of combined fields: value proposition and TOP benefits for customers |
|---|---|
| Input | Our company is technology-based with a unique and particular profile, which is distinguished by its solid analytical background acquired in multiple contexts, from academia to industry. It combines mathematical optimization with advanced data analytics to enhance the performance of companies and the strategic development of their business. Our solutions allow our customers to make more informed decisions in logistics and production operations. Through optimization and data analysis we are bringing efficiency and visualization to business processses. |
| Output | ==== Categories identified by PermID along with score ==== Business_Finance – > 1.000 Technology_Internet – > 0.985 |
| Table 1 – Example of “Category” determination by PermID “Intelligent Tagging” |
The activity “tags” determined by PermID from the concatenation of these 3 questionnaire fields are:
- A1 – Religion_Belief
- A2 – War_Conflict
- A3 – Business_Finance
- A4 – Health_Medical_Pharma
- A5 – Labor
- A6 – Entertainment_Culture
- A7 – Social Issues
- A8 – Education
- A9 – Technology_Internet
- A10 – Environment_Agriculture
- A11 – Human Interest
- A12 – Hospitality_Recreation
- A13 – Disaster_Accident
- A14 – Politics
- A15 – Sports
- A16 – Law_Crime
The final PermID encoded Questionnaire
For the 529 startup companies registered in the questionnaire, the final PermID encoded questionnaire is presented in table: BH_OCC_wStatus-IDSorted_24-Mar-2021.csv ( https://github.com/MoiraCorp/Compliance-Testing-Fairness-Assessment-using-R/blob/main/permid-preprocess/BH_OCC_wStatus-IDSorted_24-Mar-2021.csv )
Regarding this datatable it is important to note:
Each of the semantic tags A1 to A16 is represented by a probability of being associated with the descriptive text submitted by each startup
The “Status” column provides a list of actions taken by the VC firm staff regarding a particular startup submission dossier
The possible “Status” codes are:
- 1. Form Submitted
- 2. Initial Contact Phase
- 3. First interview
- 4. Deal call candidate
- 5. Deal Call Pre-selection
- 6. Q&A
- 7. Deal call TOP 10
- 8. Deal call TOP 5
- 9. BAQ
- 16. Closed
- 19. BAQ not returned
- 20. Further to follow
- 21. Discontinued
- 22. Dismissed