Data-driven VC #27: The Power of GPT-4 & LLMs in Venture Capital
š„ A guide to AI-based startup sourcing by Leadspicker
šĀ Hi, Iām Andre and welcome to my weekly newsletter, Data-driven VC. Every Thursday I cover hands-on insights into data-driven innovation in venture capital and connect the dots between the latest research, reviews of novel tools and datasets, deep dives into various VC tech stacks, interviews with experts and the implications for all stakeholders. Follow along to understand how data-driven approaches change the game, why it matters, and what it means for you.
Current subscribers:Ā 6,700+, +140 since last week
Brought to you by Affinity - Find, manage, and close more deals with Affinity
Affinity Campfire brings together industry-leading dealmakers to explore what it means to be a part of the leading relationship intelligence ecosystem. Dive deeper into the importance of data-driven sourcing and what dealmaking will look like in 2023 as we navigate a changing global landscape.
What a crazy week š¤Æ I donāt want to be the next smartass analyzing what has happened, but please letās take this whole SVB thing as a warning shot and be more thoughtful about how and where we deposit our money in the future. Itās our responsibility as VCs to not forget about it and advise founders accordingly.
Learning our lessons and leaving SVB behind us, letās get back to the fun stuff: Generative AI! Two days ago, the long-anticipated GPT-4 model got launched by OpenAI. My first impression? Surely incredible advancements, but not as mind-blowing as I personally expected.
On the pro side, we got an increased word limit, more creativity, better reliability and accuracy so that it can ace standardized tests, but on the con side, the model still hallucinates and can be hacked to bypass its guardrails. And the multi-modality? Well, our portfolio company Aleph Alpha invented and launched this as part of their model MAGMA in April 2022, so itās really just a functionality catch-up for OpenAI.
The most concerning? Iām disappointed that OpenAI, which started as an open-source non-profit organization, has become closed-source and puts profits over the community. No details about model architecture, size, training infra etc. have been disclosed yet. But thatās a topic in itself..
Independent of these concerns, the impact of large language models (LLMs) will be significant all across and for me, itās actually less about the pros and cons of a specific model version from a specific company, but more about the gradient of innovation in AI more broadly. Just extrapolate recent developments by another year or two into the future..
..which makes me think again about the impact of LLMs on our industry. Following my ā10x your productivity with ChatGPTā and āWhat ChatGPT means for the future of startup fundingā posts earlier this year, Vlasti, one of my readers kindly reached out and shared some valuable perspectives on how his team incorporates LLMs into startup scouting workflows. Some fruitful conversations later, Iām incredibly excited to have Vlastimil VodiÄka, CEO and Founder of Leadspicker, share his āGuide to AI-based startup sourcingā with us in the guest post below.
By reading this post, you will learn:
What data sources do we use and how to scrape them
How you can get data from LinkedIn (and donāt get your account blocked)
How to utilize Machine Learning (ML) to distinguish between a startup and an ordinary company
The role LLMs play in our stack, along with an example of a prompt
What's new with OpenAI GPT-4?
How do we find up-to-date founders' information and verified contact details
We have tested many tools and LLMs and in this post, I'm going to share with you the secret sauce that powers Startup Scout by Leadspicker, allowing us to discover and categorize over 100,000 new unique startup companies each year, months before they become visible in any database such as Crunchbase,
Since 2016 we have completed 750 scouting projects in 42 countries, helping to identify and approach the most exciting projects out there. Over the years, this process has been refined and optimized to the point where it has replaced 35 analysts (interns/students) who previously helped us manually research and clean data, making the sourcing process very cost-efficient and effective.
Challenges of AI-based startup scouting for VCs
Researching new companies to expand your deal flow funnel can be challenging and time-consuming. There are several pain points that you can experience:
Identifying the right data sources for scouting can be quite a challenge with so many options to choose from. While traditional Crunchbase-alike platforms such as Dealroom, Pitchbook, and Tracxn can be great starting points (thankfully Andre shared his database benchmarking results here), it's important to remember that most founders, especially those outside the US, donāt create profiles on Crunchbase as the first thing when starting their company. Instead, public sources like LinkedIn, Facebook groups, and other online communities can be a much better resource for discovering new, promising firms.
Finding good quality and relevancy is especially important in VC, as the screening capacity is always limited and hard to scale. Also, it's important to keep in mind that what defines an interesting company can vary from fund to fund. For example, an impact or a food-oriented VC fund might have a different scope than a B2B SaaS or FinTech fund. So, it can feel like searching for a needle in a haystack when you're trying to find ideas that fit specific niches or technologies.
Right people with contact details: When researching relevant startup companies, it's essential to effectively identify also the key personnel in charge, typically the founders, and their contact details. This is especially important for funds with an active deal-generation approach
Deduplication and data merging: Ā When using multiple sources, it's common to discover the same company from different web pages. To avoid duplications and ensure that you are not rediscovering companies and people that you have already found, it's critical to keep your data clean and organized through deduplication and data merging. Andre has shared some valuable learnings on this here
Timing is crucial when searching for new founders since there is a limited window when a researched company is most relevant and being the first to identify and approach them can give you a competitive advantage over other funds.
When running an outbound outreach, the key is to ensure high personalization, deliverability, open rate, and reply rate by using the right tools and techniques.
Blacklists and Deconfliction Rules: Operating across different funds in multiple industries or countries requires implementing blacklists and deconfliction rules to ensure that you don't reach out to people you shouldn't. This includes not contacting people who have already been contacted by your colleagues.Ā
Ensuring scalability and consistency is critical for staying competitive. The key is to implement the right tools and systems to manage and organize data and automate processes as much as possible.
The rising costs: Improving the effectiveness and automating or streamlining processes is cheaper and more effective than hiring more people.
Leveraging ML to empower human intelligence
At Leadspicker, we've built a unique data-driven sourcing process that connects the dots between the latest technology and human expertise to deliver efficiency and accuracy that was not achievable a year ago.
We reviewed many novel tools, including all OpenAIsā and othersā newly released LLMs.Ā Our process has been refined over the years to enable us to generate deal flow from all over the world. In the next chapters, I'll uncover our process step-by-step, which is also illustrated in the picture below, and share the technology we use at each stage, along with its benefits.
Each part of our above-shown process plays an important role in identifying and connecting with the most promising projects worldwide. In the next paragraphs, we'll guide you through the startup sourcing process step-by-step, explaining where our data comes from and how we find relevant and up-to-date information including contacts on founders.
Discovering the data sources: Where to find input datasets for next-gen AI models
We gathered over 1000 data sources (find an overview by Andre here) to find new companies from around the world. I know, thatās a lot of data. But with the latest developments in scraping tools such as Apify or Browse.ai, you can do it too!
One of the most important data sources for us is LinkedIn. We leverage LinkedIn to find anyone who has recently set up a job title as āfounderā OR āco-founderā OR āCEOā or āCTOā, and other relevant positions.
The use of LinkedIn data enables us to identify:
All new founders (< 1 month) in a given geography
Operators from your portfolio companies who have left to start something new
Diaspora or alumni from prestigious universities who are running a new company
Ex-high-value company employees who left and became founders
Here is the list of the most relevant data sources
Regularly monitored data sources:
LinkedIn profiles of anyone who has recently set up a job title as founder, co-founder, CEO, COO, CTO, or who is working in stealth mode in a fresh company
Major global databases: We use Crunchbase alike platforms as well as directories such as Product Hunt and crowdfunding platforms, to discover new and emerging companies. If you donāt know where to start this could be a great starting point:
Minor (local) startup directories: We also use smaller, more localized websites that focus on specific regions or industries that might be overlooked.Ā
Tech media: We also monitor media outlets and blogs like TechCrunch, and Sifted, as well as local outlets like CzechCrunch to stay up-to-date on the latest trending companies in the world.Ā
Relevant Facebook groups and other communities: We leverage Facebook groups and other online communities to identify founders and other key personnel. Example: https://www.facebook.com/groups/AustrianStartupPinwall/
Ad-hoc data sources:
Conference networking apps: We use apps that are designed to facilitate networking at tech conferences, which can help us to identify new companies and connect with founders and investors. After the conference, we create a list of attendees (usually consisting of name, company name, and company type) from the app. Then we use our tools to enrich this data with the right URLs, LinkedIn profiles, contact details, and other relevant information.
Portfolio of incubators, competitions, accelerators, and other VCs: We explore the websites of these organizations to identify firms that they are working with or have invested in, which can provide valuable insights into new and promising companies.
Scraping and crawling private and public data
It's not rocket science, but it does require some technical expertise. We are fans ofĀ Python packages for web scraping, such as Scrapy or Selenium, which provide basic code snippets.
Andre wrote an insightful piece on āhow to scrape alternative data sourcesā here and you can find another helpful post on this topic here. Additionally, forums like Stackoverflow offer threads that provide complete code for scraping a specific data source. Also, there are many out-of-the-box scraping tools and services like Browse.ai, Apify, or Phantombuster that could be good enough for a small-scale use case.
Our data scraping process allows for a quick turnaround time of adding new data sources within just a few hours of creating a ticket. This is helpful if you need to expand your reach and identify emerging companies, for example in regions previously not well covered, such as Africa or Southeast Asia.
*The split of industries is based on 2021-22 data and is significantly influenced by the COVID-19 pandemic. ** The individual categories were chosen according to the application form at https://apply.techstars.com
What tools for scraping LinkedIn?
We're all familiar with the fact that extracting information from LinkedIn is a challenging task when you do it by hand. This is where automated tools come in handy. By scheduling data collection and notifying you of any changes, you can save time and effort. If you only need to cover a small scale or don't require frequent updates, tools like DuxSoup, and Phantombuster,Ā can help you scrape LinkedIn data efficiently.
IMPORTANT: Automating your LinkedIn actions can lead to banning or losing your accounts, but when this happens is unpredictable. LinkedIn is not very consistent as to when they block your account.