Data-driven VC #3: Make versus buy + the best database to start with
Where venture capital and data intersect. Every week.
👋 Hi, I’m Andre and welcome to my weekly newsletter, Data-driven VC. Every Thursday I cover hands-on insights into data-driven innovation in venture capital and connect the dots between the latest research, reviews of novel tools and datasets, deep dives into various VC tech stacks, interviews with experts and the implications for all stakeholders. Follow along to understand how data-driven approaches change the game, why it matters, and what it means for you.
Current subscribers: 1,335, +201 since last week
Tl;dr
You need a GP sponsor (!)
Follow a clear framework to answer make versus buy
Go for a hybrid setup as it provides high independence with low maintenance at manageable costs
Do your homework and compare commercially available datasets to identify the best entry point for building your own solution; Crunchbase delivers the best value for price in terms of startup datasets
Gradually complement the foundation with additional identification and enrichment sources
“Venture capitalists (VCs), who fall into the class of individuals and institutions that manage private capital assets, and private companies share a common denominator: the majority of their data are kept private”. This sentence was not only the introduction of my PhD dissertation on “ML and the value of data in VC” but also the most frequent argument for why the majority of VCs did not believe in data-driven approaches a few years back. Adding that “data is hard to collect, sparse, not reliable and biased”, it felt to me more of a protective behavior of the old VC world rather than a logical argument to not even try.
Thankfully, “the data guys” started to experiment and sentiment has shifted as we hear from our Limited Partners (LPs) that a growing number of VCs nowadays claim to be “some kind of data-driven”. So in less than five years, perspectives flipped from “no way, impossible, doesn’t make any sense” to “yes, of course, obvious”. While the long feedback cycles and the lack of immediately tangible results historically prevented most VCs from significant upfront investments into engineering teams, datasets and infrastructure, the growing competition among VCs together with LPs proactively asking for the VC’s data-driven initiatives started to spur this movement.
Although IMHO the majority of VC firms are rather window dressing today, i.e. some fancy slides in the fundraising deck and having a working student stitch together some datasets, I believe it’s still day one and urge everyone to manage expectations. Only with a healthy balance between promise (fundraising slides) and reality (what’s under the hood), we will fulfill expectations and gain the required freedom to explore and transform the whole VC industry.
Make versus buy?
The question has multiple dimensions and I’d like to structure the four most important once in a top-down sequence:
Fund size and General Partner (GP) sponsor: The fund size together with the management fee (typically 2-3% of the fund size p.a.) defines the available resources to operate a fund. Deduct costs for office, team, travels, due diligence, equipment, etc and you’ll end with the free resources. While for smaller funds there is oftentimes not too much left, larger ones might end up with a profit that is either distributed among the General Partners (GPs) or available for investments into the future of the firm.
From my own and many befriended data-driven VCs’ experiences, I can tell that make or break for a proper data-driven strategy (as for many other upfront investments into long-term firm initiatives) is at least one sponsor among the GPs. Why? Because if they are short-term oriented, they will always cash out the remainder of the management fee whereas only long-term oriented GPs are willing to invest in the future of the firm. Assuming a long-term orientation and a GP sponsor for data-driven approaches, the result of the above calculation defines whether a VC needs to buy external solutions, i.e. if few resources are available you need to go for a cheaper but less sophisticated and distinguished solution) or has the resources to build in-house.
Value creation and dependency: Build core and buy non-core. For the sourcing and screening part, the value creation from an engineering perspective increases from beginning to end. Starting with the data collection, we can either buy more comprehensive datasets from commercial aggregators like Crunchbase, CBInsights, Dealroom, Pitchbok and co or leverage more focused crawling services like Phantombuster or APIfy for social media data and other more specific sources. All of them bear little downside risk as there is lots of redundancy - if one service shuts down or changes its pricing, you can replace them fairly easily without too much of an impact on your overall processes. Other than from an economic/price standpoint (external crawling services can become very expensive over time/volume), there is little reason to collect broadly available data yourself.
Only thereafter begins the actual value creation in the entity matching where we need to tie all datasets together, remove duplicates and make sure we obtain a single source of truth. This is crucial and I prefer to keep this in-house. Even more important is the following feature engineering and the training of the screening models. These signals steer investment professionals to the most promising investment opportunities and are the core of a data-driven strategy. It’s the secret sauce and in the case of external solutions hard to replace as you never know what’s under the hood. Consequently, I prefer to keep the respective efforts in-house.
Maintenance: While keeping everything in-house would be the most sought after solution in terms of value creation and dependency, we need to keep in mind that the more we insource (outsource), the higher (lower) the maintenance efforts. This question is specifically relevant for the data collection as running your own crawlers does not only require a proper proxy server infrastructure but also needs continuous adjustments to the crawlers themselves. Whenever a button is moved on the target website, we need to adjust the crawlers (or come up with an abstraction logic to scale them).