With the continued emergence of big data and data science, researchers are uncovering new insights as never before. However, the five common pitfalls, described below, can potentially lead to incorrect conclusions.
1. Measurement Error and Behavioral Regimes
Data science and quantitative research need data to empirically prove out hypotheses, and generally speaking, the more data the better. However, it is important to be aware that the system itself can change over time. Old measurements may be stale and no longer representative of the current (and more importantly, the future) way that the system will evolve. This insight requires subject matter expertise.
In a quantamental research piece, "Late to File: The costs of delayed 10-Q and 10-K company filings" (read research), the S&P Global Market Intelligence Quantamental Research Team ran their model over the entire dataset (1994-2015), but also performed a sub-period analysis to understand the impact of the Sarbanes-Oxley Act (commonly called SOX) on the economic hypothesis. By choosing to run the analysis without some of the older data, the team provided support that they were not capturing a stale anomaly.
2. Sampling Error and Extrapolation Error
Sampling a system is typically the only feasible option to explore a hypothesis. Collecting every observation may not be possible or data quality for some observations may be diminished. When the sampling is random, the expectation is that the sample is a good estimation of the whole population. However, sampling is often bias and exploring the impact of that bias is a vital step.
In one of our quantamental research pieces, "Bridges for Sale: Finding Value in Sell-Side Estimates" (read research), we performed testing within the Russell 1000 (large cap) and the Russell 2000 (small cap), to explore the impact of a size bias. While there were some performance differences, the researcher’s economic hypothesis is consistent in these two universes.
3. Omitted Variable Bias
While it is impossible to ‘know what you don’t know’, a good researcher is always on the hunt for other variables to explain an outcome. For example, in one of our popular papers, "Natural Language Processing – Part II: Stock Selection" (read research) the author addresses the obvious question – could positive executive sentiment just be a proxy for positive earnings surprises? The researcher dedicated an entire section of the paper to controlling for commonly used alpha and risk signals, published in the literature, and shows that sentiment and behavioral based NLP signals add incremental predictive power to the model.
4. Cyclicity and Detrending
Company revenues, and other operating metrics, experience regular peaks and valleys. Business cycles occur over many years whereas shorter annual cycles apply to most firms, as well. To address annual cyclicity, one of our Quantamental Researchers compared recent values to measurements made over the same period one year earlier, in the research piece, "Forging Stronger Links: Using Supply Chain Data in the Investing Process" (read research).
5. Causation vs. Correlation and Availability Bias
How often does a researcher’s prior expectation drive the findings in their research? If they took a fresh perspective, would they have drawn the same conclusions?
This is challenging because the question we are asking is, if a significant coincident or lead-lag relationship is found between two variables, how can we know whether that relationship is mere coincidence (spurious) or whether the connection is causal? While there is no way to guarantee a relationship is not spurious outside of a laboratory setting, two approaches are:
- Make sure that an economic intuition exists for the thesis being tested and establish that intuition / thesis before processing any data.
- Test the correlation on data that were not included in the original analysis (out –of-sample testing). For example, in a recent work, "Value and Momentum: Everywhere But Not All The Time" (read research), the research team evaluated a relationship found in U.S. equities and tested a similar approach in six different geographies, two size sub-universes, sub-periods and controlling for other factors.