Data-driven VC #27: The Power of GPT-4 & LLMs in Venture Capital

👋 Hi, I’m Andre and welcome to my weekly newsletter, Data-driven VC. Every Thursday I cover hands-on insights into data-driven innovation in venture capital and connect the dots between the latest research, reviews of novel tools and datasets, deep dives into various VC tech stacks, interviews with experts and the implications for all stakeholders. Follow along to understand how data-driven approaches change the game, why it matters, and what it means for you.

Subscribe now

Current subscribers: 6,700+, +140 since last week

Brought to you by Affinity - Find, manage, and close more deals with Affinity

Affinity Campfire brings together industry-leading dealmakers to explore what it means to be a part of the leading relationship intelligence ecosystem. Dive deeper into the importance of data-driven sourcing and what dealmaking will look like in 2023 as we navigate a changing global landscape.

Watch on-demand now

What a crazy week 🤯 I don’t want to be the next smartass analyzing what has happened, but please let’s take this whole SVB thing as a warning shot and be more thoughtful about how and where we deposit our money in the future. It’s our responsibility as VCs to not forget about it and advise founders accordingly.

— # (#)

Learning our lessons and leaving SVB behind us, let’s get back to the fun stuff: Generative AI! Two days ago, the long-anticipated GPT-4 model got launched by OpenAI. My first impression? Surely incredible advancements, but not as mind-blowing as I personally expected.

On the pro side, we got an increased word limit, more creativity, better reliability and accuracy so that it can ace standardized tests, but on the con side, the model still hallucinates and can be hacked to bypass its guardrails. And the multi-modality? Well, our portfolio company Aleph Alpha invented and launched this as part of their model MAGMA in April 2022, so it’s really just a functionality catch-up for OpenAI.

The most concerning? I’m disappointed that OpenAI, which started as an open-source non-profit organization, has become closed-source and puts profits over the community. No details about model architecture, size, training infra etc. have been disclosed yet. But that’s a topic in itself..

Independent of these concerns, the impact of large language models (LLMs) will be significant all across and for me, it’s actually less about the pros and cons of a specific model version from a specific company, but more about the gradient of innovation in AI more broadly. Just extrapolate recent developments by another year or two into the future..

..which makes me think again about the impact of LLMs on our industry. Following my “10x your productivity with ChatGPT” and “What ChatGPT means for the future of startup funding” posts earlier this year, Vlasti, one of my readers kindly reached out and shared some valuable perspectives on how his team incorporates LLMs into startup scouting workflows. Some fruitful conversations later, I’m incredibly excited to have Vlastimil Vodička, CEO and Founder of Leadspicker, share his “Guide to AI-based startup sourcing” with us in the guest post below.

By reading this post, you will learn:

What data sources do we use and how to scrape them
How you can get data from LinkedIn (and don’t get your account blocked)
How to utilize Machine Learning (ML) to distinguish between a startup and an ordinary company
The role LLMs play in our stack, along with an example of a prompt
What's new with OpenAI GPT-4?
How do we find up-to-date founders' information and verified contact details

We have tested many tools and LLMs and in this post, I'm going to share with you the secret sauce that powers Startup Scout by Leadspicker, allowing us to discover and categorize over 100,000 new unique startup companies each year, months before they become visible in any database such as Crunchbase,

Since 2016 we have completed 750 scouting projects in 42 countries, helping to identify and approach the most exciting projects out there. Over the years, this process has been refined and optimized to the point where it has replaced 35 analysts (interns/students) who previously helped us manually research and clean data, making the sourcing process very cost-efficient and effective.

Pic. 1 : Exploring Our Reach: A Map of Successful Scouting Projects

Challenges of AI-based startup scouting for VCs

Researching new companies to expand your deal flow funnel can be challenging and time-consuming. There are several pain points that you can experience:

Identifying the right data sources for scouting can be quite a challenge with so many options to choose from. While traditional Crunchbase-alike platforms such as Dealroom, Pitchbook, and Tracxn can be great starting points (thankfully Andre shared his database benchmarking results here), it's important to remember that most founders, especially those outside the US, don’t create profiles on Crunchbase as the first thing when starting their company. Instead, public sources like LinkedIn, Facebook groups, and other online communities can be a much better resource for discovering new, promising firms.
Finding good quality and relevancy is especially important in VC, as the screening capacity is always limited and hard to scale. Also, it's important to keep in mind that what defines an interesting company can vary from fund to fund. For example, an impact or a food-oriented VC fund might have a different scope than a B2B SaaS or FinTech fund. So, it can feel like searching for a needle in a haystack when you're trying to find ideas that fit specific niches or technologies.
Right people with contact details: When researching relevant startup companies, it's essential to effectively identify also the key personnel in charge, typically the founders, and their contact details. This is especially important for funds with an active deal-generation approach
Deduplication and data merging: When using multiple sources, it's common to discover the same company from different web pages. To avoid duplications and ensure that you are not rediscovering companies and people that you have already found, it's critical to keep your data clean and organized through deduplication and data merging. Andre has shared some valuable learnings on this here
Timing is crucial when searching for new founders since there is a limited window when a researched company is most relevant and being the first to identify and approach them can give you a competitive advantage over other funds.
When running an outbound outreach, the key is to ensure high personalization, deliverability, open rate, and reply rate by using the right tools and techniques.
Blacklists and Deconfliction Rules: Operating across different funds in multiple industries or countries requires implementing blacklists and deconfliction rules to ensure that you don't reach out to people you shouldn't. This includes not contacting people who have already been contacted by your colleagues.
Ensuring scalability and consistency is critical for staying competitive. The key is to implement the right tools and systems to manage and organize data and automate processes as much as possible.
The rising costs: Improving the effectiveness and automating or streamlining processes is cheaper and more effective than hiring more people.

Leveraging ML to empower human intelligence

At Leadspicker, we've built a unique data-driven sourcing process that connects the dots between the latest technology and human expertise to deliver efficiency and accuracy that was not achievable a year ago.

We reviewed many novel tools, including all OpenAIs’ and others’ newly released LLMs. Our process has been refined over the years to enable us to generate deal flow from all over the world. In the next chapters, I'll uncover our process step-by-step, which is also illustrated in the picture below, and share the technology we use at each stage, along with its benefits.

Pic. 2 : Our AI-Driven Scouting, Campaign Management, and Evaluation Process

Each part of our above-shown process plays an important role in identifying and connecting with the most promising projects worldwide. In the next paragraphs, we'll guide you through the startup sourcing process step-by-step, explaining where our data comes from and how we find relevant and up-to-date information including contacts on founders.

Discovering the data sources: Where to find input datasets for next-gen AI models

We gathered over 1000 data sources (find an overview by Andre here) to find new companies from around the world. I know, that’s a lot of data. But with the latest developments in scraping tools such as Apify or Browse.ai, you can do it too!

One of the most important data sources for us is LinkedIn. We leverage LinkedIn to find anyone who has recently set up a job title as “founder” OR “co-founder” OR “CEO” or “CTO”, and other relevant positions.

The use of LinkedIn data enables us to identify:

All new founders (< 1 month) in a given geography
Operators from your portfolio companies who have left to start something new
Diaspora or alumni from prestigious universities who are running a new company
Ex-high-value company employees who left and became founders

Here is the list of the most relevant data sources

Regularly monitored data sources:

LinkedIn profiles of anyone who has recently set up a job title as founder, co-founder, CEO, COO, CTO, or who is working in stealth mode in a fresh company
Major global databases: We use Crunchbase alike platforms as well as directories such as Product Hunt and crowdfunding platforms, to discover new and emerging companies. If you don’t know where to start this could be a great starting point:
Minor (local) startup directories: We also use smaller, more localized websites that focus on specific regions or industries that might be overlooked.
Tech media: We also monitor media outlets and blogs like TechCrunch, and Sifted, as well as local outlets like CzechCrunch to stay up-to-date on the latest trending companies in the world.
Relevant Facebook groups and other communities: We leverage Facebook groups and other online communities to identify founders and other key personnel. Example: https://www.facebook.com/groups/AustrianStartupPinwall/

Ad-hoc data sources:

Conference networking apps: We use apps that are designed to facilitate networking at tech conferences, which can help us to identify new companies and connect with founders and investors. After the conference, we create a list of attendees (usually consisting of name, company name, and company type) from the app. Then we use our tools to enrich this data with the right URLs, LinkedIn profiles, contact details, and other relevant information.
Portfolio of incubators, competitions, accelerators, and other VCs: We explore the websites of these organizations to identify firms that they are working with or have invested in, which can provide valuable insights into new and promising companies.

Data-driven VC #27: The Power of GPT-4 & LLMs in Venture Capital

By reading this post, you will learn:

Challenges of AI-based startup scouting for VCs

Leveraging ML to empower human intelligence

Discovering the data sources: Where to find input datasets for next-gen AI models

Here is the list of the most relevant data sources

Reply

Keep Reading

Become a better investor with data & AI

Data-driven VC #27: The Power of GPT-4 & LLMs in Venture Capital

By reading this post, you will learn:

Challenges of AI-based startup scouting for VCs

Leveraging ML to empower human intelligence

Discovering the data sources: Where to find input datasets for next-gen AI models

Here is the list of the most relevant data sources

Subscribe to keep reading

Reply

Keep Reading

Become a better investor with data & AI