Data-driven VC #5: How to create a single source of truth for startups?
Where venture capital and data intersect. Every week.
👋 Hi, I’m Andre and welcome to my weekly newsletter, Data-driven VC. Every Thursday I cover hands-on insights into data-driven innovation in venture capital and connect the dots between the latest research, reviews of novel tools and datasets, deep dives into various VC tech stacks, interviews with experts and the implications for all stakeholders. Follow along to understand how data-driven approaches change the game, why it matters, and what it means for you.
Current subscribers: 1,646, +203 since last week
Welcome to so many new readers this week! To bring everyone up to speed, I’d like to start with a quick run-through of the previous episodes and their major takeaways.
In episode#1, I summarized why VC is broken and found that 2/3 of the VC value is created in the sourcing and screening stages. Said differently, VC is a “finding and picking the winners game” which consists of two different tasks: 1) comprehensive identification coverage top of funnel and 2) enough enrichment data to cut through the noise. Hereof, episode#2 explores a variety of novel human-centric sourcing approaches and why comprehensive coverage is only possible with data-driven methods. Episode#3 then compares “make versus buy” with the result that a hybrid setup is the way to go. Moreover, it contains my database benchmarking study from 2020 with the finding that Crunchbase is the best value for price. Subsequently, episode#4 describes how the foundation of Crunchbase (or any other commercial dataset) can be complemented via scraping and crawling of alternative data sources.
Assuming all previous steps have been properly implemented, we end up with comprehensive coverage and multiple datasets that altogether include every startup company in the world. At least once.. or twice.. or more.. Let the clone wars begin!
How to deal with duplicates?
Given that we collect data from multiple data sources, we face two major problems (aka “the clones” or duplicates):
Redundancy/duplicates in companies
Redundancy/duplicates in features of the respective companies
Below figure shows a simple example for 1. redundancy/duplicates in companies across three different data sources: Crunchbase (CB), LinkedIn (LI) and Companies House (CH, the UK public register). It shows some overlap between LI and CB, some between CB and CH, some between CH and LI and, the tiniest part in the middle, some between LI, CB and CH. As a result, many companies are included once (no overlap), some of them twice (overlap between two sources) and a few of them even in all three datasets (overlap of all three sources in the middle).
To properly present or extract value from the data, we first need to remove company redundancy and create a single source of truth. Every company needs to be included once, no more and no less.
The holy grail: Entity matching
Entity matching, entity linking, re-identification, data linkage and many more terms actually describe the same approach: Identify all record pairs across data sources that refer to the same entity. There are basically two buckets of methods: rule-based and deep learning-based ones.
Rule-based Fuzzy String Matching, also known as Approximate String Matching, is the process of finding strings that approximately match a pattern. A good example is Fuzzywuzzy, a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns. It solves problems where an entity such as a person’s or a company’s name can be labeled differently on different sources.
Some code snippets below to import and execute a similarity matching of the exemplary parameters “Visionary Startup#1 Limited” and “Visionary S#1 LTD”.
from fuzzywuzzy import fuzz
from fuzzywuzzy import processfuzz.ratio("Visionary Startup#1 Limited","Visionary S#1 LTD")
#94fuzz.partial_ratio("Catherine M. Gitau","Catherine Gitau")
#100
In the following example, I’ve switched the parameter “Visionary Startup#1 Limited” to “Startup#1 Visionary Limited” to showcase the impact of the order.
fuzz.ratio(“Startup#1 Visionary Limited”,"Visionary S#1 LTD")
#52fuzz.partial_ratio(“Startup#1 Visionary Limited”,"Catherine Gitau")
#58
We see that both methods are giving out low scores, this can be rectified by using token_sort_ratio() method. This method attempts to account for similar strings that are out of order. For example, if we used the above strings again but using token_sort_ratio() we get the following.
fuzz.token_sort_ratio(“Startup#1 Visionary Limited”,"Visionary S#1 LTD")
#96
So far, so clear for the fuzzy string matching. Next, we need to ask the question “which features are likely to be included across sources and are suitable for fuzzy string matching?” The most obvious ones are “company name”, “website URL” and “headquarters/location”. Therefore, we need to establish a logic that runs a fuzzy string matching on the company names, website URLs and headquarters/locations of all companies in the dataset.
We can even add more rule-based layers such as calculating the physical distance between the zipcodes of the headquarters or creating a dictionary of keywords that can be identified in the descriptions. The occurrence/frequency of specific keywords across descriptions can then be measured and compared across companies to measure similarity. Read this paper on matching CB and patent data for further rule-based fuzzy string-matching ideas specifically for startups.
Subsequently, we need to aggregate the individual similarity scores across features into one unified similarity score per compared company sample, i.e. it’s important to score high similarity across features, not only in one or few. The unified similarity score threshold of “match” versus “no match” depends a lot on the number of data sources and included features. Therefore, I suggest playing a bit around and manually finding the optimal threshold for automated matching in your specific setup.
Neural networks and deep learning-based matching approaches are able to learn useful features from relatively unstructured input data. Central to all these methods is how text is transformed into a numerical format suitable for a neural network. This is done through embeddings, which are translations from textual units to a vector space – traditionally available in a lookup table. The textual units will usually be characters or words.