Data-driven VC #4: How to scrape alternative data sources?
Where venture capital and data intersect. Every week.
👋 Hi, I’m Andre and welcome to my weekly newsletter, Data-driven VC. Every Thursday I cover hands-on insights into data-driven innovation in venture capital and connect the dots between the latest research, reviews of novel tools and datasets, deep dives into various VC tech stacks, interviews with experts and the implications for all stakeholders. Follow along to understand how data-driven approaches change the game, why it matters, and what it means for you.
Current subscribers: 1,443, +108 since last week
Quick recap: In the last episode, I covered make versus buy for data collection and elaborated on why a hybrid setup is the way to go. Hereof, I shared my database benchmarking study that identifies Crunchbase as the best value for price provider. With this foundation, let’s now explore how we can complement the foundation with further identification and enrichment sources.
The universe of alternative data sources is sheer endless. Thankfully, my friend Francesco has created a list (screenshot above) and classified all providers into traditional data, alternative data, research, dealflow/CRM systems, matchmaking tools, portfolio management tools, news resources, scouting sources, LP tools, liquidity instruments and infrastructure. Certainly not comprehensive, but a great starting point. With respect to the sourcing and screening stages of the VC value chain, the highlighted providers/websites above are the most relevant ones.
How to get access to alternative data sources?
Let’s start by splitting the alternative data sources into three categories based on their availability (paywall yes/no) and usefulness to scrape (versus just buy):
Paywall and unreasonable to scrape/crawl: Providers in this bucket require a paid subscription but in many cases, the subscription comes along with free data dumps or API access. Even if API access comes at a premium, it’s mostly cheaper than scraping the respective data. Examples include the benchmarked database providers in my previous episode but also alternative sources like SimilarWeb, Semrush or SpyFu. It’s easy to buy them and economically unreasonable to scrape.
Paywall (=private) and reasonable to scrape/crawl: Many providers limit their offering to subscribed users and, in turn, little to no data is available to outsiders. Examples include LinkedIn private profiles (with extended experience, skills, recommendations etc.) that are only accessible via subscriptions like LI Sales Navigator or news providers like TechCrunch, Sifted or Financial Times that hide (parts of) their content behind a paywall.
No paywall (=public) and reasonable to scrape/crawl: This bucket includes all sources that can be accessed without a subscription. Said differently, publicly available data. An easy way to find out if a specific source falls into this bucket is to open your browser in incognito mode, open the website of interest and explore which data you can find without logging in. Examples with high value public data include Github, ProductHunt, LinkedIn public profiles, Twitter or public registers like Handelsregister or Companies House.
Obviously, all data in category 1. should be bought and integrated as a data dump or via an API. The big question mark is rather on categories 2. (private) and 3. (public).
How to scrape/crawl private and public data?
Some terminology upfront: Many people, including myself, use “scraping” and “crawling” interchangeably, but there’s actually a tiny difference: Web scraping is the aimed extraction of specific data from a specific website, whereas web crawling is the aimless collection of any data from a specific website. Linking this to the previous terminology from episode#2, aimless crawling is theoretically more relevant for identification (you never know where an interesting startup might show up, right?) whereas scraping is more relevant for enrichment (you know what you’re looking for).
But anyways, let’s leave the academic nuances aside and jump right in. There are two components of web scraping at scale: the scraping algorithm itself and the execution.
Scraping algorithm (x-axis in the figure below): No need to reinvent the wheel. Many Python packages like BeautifulSoup, Selenium, Scrapy or XPath provide the basic code snippets; see a great post on this here. Stackoverflow and other forums even provide threads that contain the full code for scaping a specific data source. At its core, a scraper emulates human behavior and executes data extraction and button clicks in an automated way. It’s really no rocket science.