Only 4 days left until our Virtual DDVC Summit 23-25th March - Learn how Accel, Atomico, Bessemer, BlackRock, NEA, NFX, LocalGlobe, Point Nine, and more use tools like OpenClaw, Claude, n8n or Harmonic to generate alpha


Brought to you by Harmonic - The Startup Discovery Engine

Scout is the AI for investors.

It understands private markets and the people behind them so you can:

  • Map markets

  • Find founders

  • Evaluate companies

  • Draft outreach

  • And so much more…

All in one conversation.


In 2022, knowing how to scrape LinkedIn for founders in stealth mode made you one of the best-sourced investors in the room. In 2026, it makes you average.

That is the compressed version of what happened to alternative data in VC. The longer version is more nuanced, and it kept us busy at Earlybird for a long time.

In today's episode, I share what our external data stack actually looked like, what our 2025 benchmark revealed, why we migrated to Harmonic as our primary data backbone, and where we believe the real edge in data is moving.

Let’s dive in.

What we used to run and why it stopped making sense

We started our digitization journey at Earlybird in 2017. Our stack started simple but has grown more complex faster than we anticipated.

Why? Because a) the vendor landscape went from zero to one, growing from 100 solutions in 2017 to 1k+ by 2025, b) we needed to stitch together various solutions to cover key parts of the value chain, and c) there was still lots left that just wasn’t covered, so we needed to build it ourselves.

For data, Crunchbase was our baseline, with dozens of additional sources like Pitchbook, CBInsights, or Preqin layered on top, a full web scraping infrastructure above that, and the engineering work to merge, deduplicate, and reconcile all of it into something usable.

In 2024, we reached the peak of complexity and decided to start simplifying, gradually cutting the vendor list in half. Historically, each data provider had a distinct angle: stage focus, industry, geography, which meant you had to stack them to get anywhere near full coverage. That stacking logic gradually broke down as providers converged in both coverage and accuracy, a finding consistent across the several benchmarking studies I have published over the years. As redundancy evolved, it got easier to trim.

In 2025, we then asked ourselves a more fundamental question: Do we still generate alpha with our large-scale scraping infra and public data layering?

For the first time, we did not only benchmark data providers against original data from investment documents, as in my previous studies, but against our fully merged database that contains all public and private information available to Earlybird. The most comprehensive data benchmarking we’ve conducted to this point.

We assembled a gold dataset of 1,000 companies from our internally merged single source of truth, deliberately spanning the full spectrum: non-incorporated stealth founders, pre-seed companies with no public footprint, seed and Series A companies across geographies and sectors, and growth-stage companies with well-documented histories. We then tested coverage and accuracy against both the traditional providers from prior studies and next-gen players.

The results were decisive. Traditional providers landed around 75% coverage against our gold dataset. Harmonic achieved 98%. That was a number we had not seen before. For the first time, a single provider came close to matching the combined output of our entire stacked operation.


Once you see that number, the math on maintaining your own scraping infrastructure and juggling multiple subscriptions changes entirely. With the data at hand, we made a deliberate choice: migrate to Harmonic as our primary public data backbone and ramp down the majority of remaining vendors and internal scraping infra.

Join 1421+ investors in our free Slack group as we automate our VC job end-to-end with Claude, OpenClaw, n8n & more.

What made the data worth consolidating around

The gap between Harmonic and traditional providers is most pronounced at the early end of the spectrum. Every serious provider covers Series B companies with announced rounds and press coverage.

Where PitchBook, CBInsights, Crunchbase, and others consistently fell short was at the stealth, pre-seed, and pre-announcement end, exactly where early-stage investors can make a huge difference. Funding data was also stronger than we expected. This is an area where many investors typically cross-reference across providers, and Harmonic's coverage was thorough enough to rely on.

The second thing that stood out was how the data is structured. Traditional providers treat companies as the primary entity. Harmonic is built people-first: founder tracking, talent movement signals, team composition, network mapping.

This is not a cosmetic difference, but reflects how early-stage investors actually think. You back people before you back companies. A database architecture that treats founders as attributes of a company record rather than as the primary unit of analysis was not designed for how we work.

What this freed us to do

The real question behind this migration was never "which database is best." 

It was: where does alpha actually come from now, and are we spending our time there?

As I wrote in last week's episode, generating alpha will be driven by merging public with private data, incorporating the taste of your investment firm, and making insights actionable.

That belief is what shaped our decision and brings me to a principle I apply to our entire tech stack: if something does not generate alpha and is available to buy, buy it. 

Applied to our data infrastructure, that principle forced a clear separation between what is ours and what is not. Earlybird has thirty years of institutional memory:

  • Company-centric: Reportings and board materials of 250+ portfolio companies. 1k+ investment memos. 100k+ CRM entries. Hundreds of thousands of pitch decks and meeting notes. Needless to speak about the network and social capital data that accumulates when a firm shows up consistently in the same ecosystem for thirty years.

  • Decision-centric: 500+ documented IC outcomes. 100k+ companies in our platform EagleEye that got reviewed by our investment team. That’s 100k+ documented decisions with reasons for excitement or rejection. That data drives the reinforcement learning, codifying the “taste of Earlybird” and prioritizing not only high “probability of success” opportunities, but the subset of those which are most likely to go through our IC.

That is where our edge lives. Not in having access to a startup database, but in the private, irreplaceable company- and decision-centric information that three decades of disciplined investing produce. The deep insights across the portfolio and “investment taste” of the firm, which no one can buy or scrape.

The freed capacity from consolidating our external data stack is now directed at activating that institutional knowledge. We have built significant internal systems using Claude with a library of custom skills assembled over months, connecting our private data mentioned above, so we can synthesize across all of it.

I can generate an investment proposal for a new company that pulls in everything: people, traction and company data from Harmonic, private context from our own records, prior IC decisions on comparable companies, and notes from every interaction our team has had. The depth is something you could not have imagined even two years ago.

What this looks like without 30 years of data

Most VC firms do not have Earlybird's depth of internal data. But every firm has some version of it, and the question is whether you can activate it.

It took us years of engineering effort and significant resources to build the systems that connect our private data to our public data layer and make it all actionable.

Scout, Harmonic's AI agent for investors, gets you surprisingly close to that out of the box.

It can find companies, screen against a thesis, generate diligence reports and market maps, and run deep research in a single conversation. It can also do something as simple as finding the right founders to invite to a dinner while travelling to a new city.

By combining your team's network, custom scoring criteria, and other firm-specific factors with a comprehensive public data source, you can begin to approach your own version of the alpha described above.

None of this was available before. For the vast majority of the market, this is a step change in what is possible.

A general model that actually understands private markets.

Stay driven,
Andre

PS: Reserve your seat for our Virtual DDVC Summit 2026 where 40+ expert speakers will share their workflows, tool stacks, and discuss the latest insights about AI for VC

Reply

Avatar

or to participate

Keep Reading