Chapter 03May 30, 202612 min read

Backtest pitfalls

Look-ahead bias, survivorship bias, in-sample tuning, transaction-cost models. The boring discipline that separates plausible returns from real ones.

A backtest is a story about money that was never actually made. The story is useful only to the extent that the procedure used to tell it could have been executed in real time, on data that existed at the time, with realistic frictions, on a universe that was actually tradable. Almost every novice backtest fails one of those tests, and the failure modes are not exotic — they are the same six bugs, in different costumes, that have been written about for forty years. This chapter walks each in turn, shows the smallest piece of code that produces the bug, and points to the AlphaHub guard that catches it.

1. Look-ahead bias

Look-ahead bias is the use of information at time t that you could not actually have known until some t + k. It is the most common bug in toy backtests and the most embarrassing in production ones. The textbook example is merging tomorrow's close onto today's signal because both rows happen to share the same date column.

import pandas as pd

# WRONG: signal on date t uses prices through date t,
# but the merged return is *also* from date t. We are
# trading the same bar we observed.
prices = pd.read_parquet("close.parquet")        # DatetimeIndex
signal = (prices > prices.rolling(20).mean()).astype(float)
ret = prices.pct_change()
pnl = (signal * ret).sum(axis=1)                 # peeks into today

The fix is one line — signal.shift(1) so the position decided at the close of day t earns the return from t to t+1 — but the bug recurs in subtler forms: fundamentals stamped with their fiscal period end rather than their announcement date, analyst estimates time-stamped at consensus rather than first publication, index membership lists labelled by the date they were queried rather than the date they were effective. Lopez de Prado (2018) catalogues fifteen variants; the unifying rule is to ask, for every column you join, "what was the earliest moment a human at a Bloomberg terminal could have read this value?"

AlphaHub guard: the /backtest engine runs strictly in chronological order — signals computed on bar t can only act on bar t+1. The validation.json artifact emitted with every backtest run flags any signal column whose timestamps lead the price column it is merged against.

2. Survivorship bias

Survivorship bias is backtesting on the universe of instruments that exist today rather than the universe that existed at each historical date. The S&P 500 of 2026 does not contain Lehman Brothers, Bear Stearns, Enron, or Washington Mutual. If your "S&P 500 from 2000" backtest uses today's constituents, you have silently dropped the names that went to zero and kept the names that compounded. Brown, Goetzmann and Ross (1995), in their CRSP delisting study Survival, showed that adjusting for survivorship reduces mutual-fund alpha estimates by 0.5 to 1.5 percentage points per year — a margin large enough to invert the sign of most claimed edges.

The fix is to use a point-in-time universe: a table that, for each date, lists the instruments tradable on that date, with the correct add/drop dates and the correct delisting returns (often -100% for bankruptcies, not "NaN" and ignored). The fix is harder than it sounds because most free data sources do not record the delistings.

AlphaHub guard: backtests against the curated US-equity universe use a PIT membership table sourced from S&P historical constituents plus CRSP delisting returns; the universe expansion shows in the backtest report as "tradable_today / tradable_then" and you should look at it.

3. Overfitting and data dredging

Overfitting is the use of in-sample performance to choose parameters and then reporting the in-sample performance as the expected out-of-sample result. The pathology compounds with the number of variants you try. If you test 100 random parameter combinations on the same data, the best one will look impressive by chance alone — even if no real signal is present.

Harvey, Liu and Zhu (2016), in ...and the Cross-Section of Expected Returns, argue that the standard t > 2.0 publication threshold is wildly insufficient once you account for the unknown number of factors not published. They propose adjusted thresholds of t > 3.0 for well-cited tests and t > 3.78 for new tests. Lopez de Prado (2018, ch. 12) operationalises this with the deflated Sharpe ratio, which subtracts a penalty for the number of independent trials.

# A naive grid search that will overfit
for lookback in range(5, 250):
    for threshold in [0.0, 0.001, 0.005, 0.01]:
        sharpe = run_backtest(lookback, threshold)["sharpe"]
        if sharpe > best:
            best = sharpe
            best_params = (lookback, threshold)
# best_params will look great in-sample and be noise out-of-sample

Disciplines that help: hold out the last 20% of the sample untouched, write down a small number of ex ante candidate parameters before running anything, prefer round-number choices (lookback = 20, not 247) because they are harder to curve-fit to, and apply the deflated-Sharpe correction.

AlphaHub guard: when you template a strategy with parameter sweeps, the workspace reports both the raw Sharpe and the deflated Sharpe in the metrics card; if the deflated value is below 0.5 the run is flagged as "likely overfit."

4. Transaction-cost misestimation

A strategy with a Sharpe of 2.0 at zero costs can become a Sharpe of -1.0 at realistic costs. The components are commission (small and falling), spread (visible on the order book), market impact (rises with order size and falls with liquidity), and opportunity cost (the price moves while you wait to be filled). Almgren and Chriss (2000) decompose the impact term into a temporary part proportional to participation rate and a permanent part proportional to size; their model is the starting point for every modern execution algorithm.

A reasonable rule of thumb for liquid US large-cap equity at modest size is 3 to 8 basis points per round trip. For small-cap, 20 to 50 bps is plausible. For emerging markets or low-liquidity crypto, 50 to 200 bps is realistic. A backtest that assumes 0 bps is reporting fantasy. The cheap-and-correct upgrade is a constant cost per round trip set conservatively; the expensive-and-better upgrade models cost as a function of order size relative to average daily volume.

AlphaHub guard: the backtest engine takes a cost_bps argument and applies it to turnover at every rebalance; the default for the US large-cap template is 5 bps, raised to 25 bps for the Russell 2000 template, and the metrics card always shows gross and net Sharpe side by side so the cost wedge is visible.

5. Regime change

A regime is a period during which the statistical properties of the market are roughly stable. Cross-asset correlations, equity volatility, interest-rate regime, and liquidity all shift, and a strategy trained on one regime may fail in the next. A momentum strategy trained on 2010-2019 (low vol, persistent trends, ZIRP) and traded in 2020 (vol explosion, momentum crash in March, regime shift in April) is a canonical example. The 2020 momentum drawdown was the largest single-month loss for the academic momentum factor since 2009.

Quick diagnostic: refit nothing, but report performance by sub-period. If the strategy earned a Sharpe of 1.8 in 2014-2019 and a Sharpe of 0.1 in 2020-2024, the headline "Sharpe of 1.1 across the full sample" is misleading — the strategy probably stopped working.

AlphaHub guard: split the sample manually and compare. AlphaHub does not auto-detect regime change — there is no built-in regime classifier — but the workspace makes it cheap to run the same strategy across two date ranges and inspect the metrics deltas.

6. Hindsight selection (meta-overfitting)

This is the bias that survives all the others. Even if you fix look-ahead, survivorship, parameter overfitting, costs, and regime — you, the researcher, are still selecting among many candidate strategies and pursuing the ones that already happen to have worked. The strategy you eventually ship is, by construction, one whose backtest looked good. This is overfitting at the level of the research process, not the individual backtest.

Disciplines that help: write down the hypothesis before you run the backtest, keep a research log of every strategy you ran (including the failures), and treat your reported numbers as conditional on the implicit search you did to find them. Lopez de Prado (2018, ch. 11) frames this as the probability of backtest overfitting and gives a combinatorial procedure to estimate it.

There is no software fix for hindsight selection. The only defence is intellectual honesty about how many things you tried before you found the one you are reporting.

The 5-bps test

Before you trust any backtest, raise the cost assumption by 50% and rerun. If the Sharpe degrades by more than 30%, the strategy is fragile to costs and probably will not survive contact with real markets. If the Sharpe collapses (drops by more than 60%) you are trading noise.

Try it in AlphaHub

Compare regime stability of a simple strategy.

Run sma_crossover on SPY for 2010-01-01 to 2014-12-31 and 2020-01-01 to 2024-12-31 separately. Report Sharpe, max drawdown, and turnover for each, and tell me whether the strategy's performance was stable across the two regimes.

Open workspace

References

Almgren, R. and Chriss, N. (2000). Optimal execution of portfolio transactions. Journal of Risk, 3(2), 5–39.
Brown, S., Goetzmann, W., and Ross, S. (1995). Survival. Journal of Finance, 50(3), 853–873.
Harvey, C., Liu, Y., and Zhu, H. (2016). ...and the cross-section of expected returns. Review of Financial Studies, 29(1), 5–68.
Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapters 11, 12, and 15.

← Previous · Chapter 02

What is a trading strategy?

Next · Chapter 04 →

Risk management