Data-driven VC #6: "Sh*t in, sh*t out" and why feature engineering is the ultimate differentiator for VCs
Where venture capital and data intersect. Every week.
👋 Hi, I’m Andre and welcome to my weekly newsletter, Data-driven VC. Every Thursday I cover hands-on insights into data-driven innovation in venture capital and connect the dots between the latest research, reviews of novel tools and datasets, deep dives into various VC tech stacks, interviews with experts and the implications for all stakeholders. Follow along to understand how data-driven approaches change the game, why it matters, and what it means for you.
Current subscribers: 1,906, +260 since last week
Disclaimer: This and the next episode will be a bit nerdier 🤓
Another week, another episode :) The last five episodes focused on why VC is broken and why to start fixing it in the sourcing and screening part (episode#1), how we can (or rather need to) complement human-centric approaches with data-driven ones (episode#2), why a hybrid setup is the best answer to “make versus buy”, how we can leverage commercial startup databases (including a benchmarking across the most prominent providers in episode#3), how we can complement the foundation with web crawlers/scrapers (episode#4) and, lastly, how we use entity matching to create a single source of truth (episode#5).
Assuming the above has been diligently implemented, we have achieved comprehensive coverage with respect to the number of startups top of the funnel (identification) and also with respect to the data for every individual startup itself (enrichment). Moreover, everything has been merged into a single source of truth, without duplication whatsoever. The problem, however, is that enrichment data is messy. Very messy. So let’s clean up!
Data Cleaning and Feature Engineering
First off, we never change the original feature values but store them as they are in a data lake. Only thereafter, we establish data pipelines that clean, transform and process the features. With these data pipelines, we follow two major goals:
Prepare data to be consumed by a frontend so that it can be presented to and interacted with by a user (for manual analysis and startup exploration)
Prepare data to train and run algorithms, including NLP, classification and scoring models (to eventually cut through the noise)
To achieve both, we need to apply two different techniques, data cleaning and feature engineering. “What’s the difference between data cleaning and feature engineering” you rightfully ask. Data cleaning refers to the process of dealing with incomplete, irrelevant, corrupt or missing records in our dataset, whereas feature engineering is the process of applying domain knowledge to transform existing or create new features for ML model training. It’s a fine line but on the highest level data cleaning is a process of subtraction whereas feature engineering is a process of addition. Fun fact, Data Scientists spend about 2/3 of their time subtracting and adding, aka data cleaning and feature engineering.
Data cleaning and feature engineering operations depend on the feature types, so let’s look into the three major ones in our dataset:
String values are text data. We need to make sure that the text is consistent. For example, capitalization might cause problems when processing because it can change the meaning of a word or sentence, like “Bill” as a name versus “bill” as an invoice. In line with capitalization, we should also run simple spell checkers like PySpellchecker. Following the “every startup once, no more and no less” (=single source of truth) approach in episode#5, I also prefer a single “language of truth” for all string features. As we crawl/scrape startup data from different sources across different geographies, it makes sense to translate all text data into a uniform language, English in our case.
Other string-related issues include inconsistencies in formatting. For example, if you have a column of US dollar amounts, you might want to convert any other currency type into US dollars so as to preserve a consistent standard currency. The same applies to abbreviations like “K” for thousands or “M” for millions, and for any other form of measurement such as grams, ounces, etc. You get it.Numerical values are the most common data type that we need to convert when cleaning our data. Often numbers are included as strings, however, in order to be processed, they need to appear as numeric values. If they appear as text, they are classed as a string and neither can we present it in the frontend as we want (e.g. the user wants to sort numerical features ascending or descending) nor can the algorithms perform mathematical operations on them.