Drift Detection for Live Trading Strategies: Comparing Broker Performance Against Backtest Baselines

A backtest is a promise. It says: under these conditions, over this data, this strategy produces these returns, this drawdown, this Sharpe. The promise holds only as long as the market regime matches the training window. When it does not -- when live performance diverges from the baseline -- you need a systematic answer to the question: is this strategy still the same strategy?

The DriftEngine answers that question. It compares live broker execution metrics against the backtest baseline that earned the strategy its promote ticket, and returns a recommendation: continue, monitor, pause, or retire. No dashboards to watch, no manual threshold tuning, no subjective calls at 3 AM when a position moves against you.

What Drift Measures

The engine evaluates five metrics, each against a configurable threshold:

1. Trade count. A strategy that opened 200 trades in the backtest but only 12 live might not have seen enough market conditions for a fair comparison. The engine requires a minimum observation count before generating a non-trivial recommendation.

2. Win rate drift. The ratio of winning trades to total trades. A strategy that backtested with 65% wins but wins only 40% live is behaving differently. The threshold is an absolute percentage point delta, not a ratio -- a 10 percentage point drop from 65% to 55% flags the same as 80% to 70%.

3. Profit factor drift. Gross profit divided by gross loss. Below 1.0 means the strategy is losing more than it makes. The engine compares the live profit factor against the baseline and flags divergence beyond a configured ratio.

4. Max drawdown breach. The most important metric for survival. If live peak-to-trough exceeds the backtest maximum by a configurable multiplier (default 1.5x), the engine recommends pausing. A strategy that survived a 15% drawdown in backtest should not be allowed to hit 30% live before someone notices.

5. Average trade duration drift. A subtle signal. If the strategy held positions for 4 hours on average in backtest but now holds for 8 hours live, the execution pattern has shifted even if P&L looks fine. Duration drift often precedes P&L degradation.

How the Recommendation Works

The engine does not use a weighted score or ML model. It uses a decision matrix:

If no broker observations exist: INSUFFICIENT_DATA -- wait.
If fewer than minObservationCount trades: INSUFFICIENT_DATA -- wait.
If max drawdown exceeds the multiplier: PAUSE -- stop executing immediately.
If two or more metrics exceed their drift thresholds: PAUSE.
If exactly one metric exceeds threshold: MONITOR -- keep running, flag for review.
If all metrics within bounds but trade count is low: MONITOR -- keep running, accumulate data.
If all metrics within bounds and sufficient trades: CONTINUE -- no action needed.
If the deployment label is not broker-backed (e.g. simulation only): INSUFFICIENT_DATA -- drift requires real execution.

There is no RETIRE recommendation in the automated path. Retirement is a human decision based on the accumulated evidence. The engine surfaces the data; the operator decides.

Implementation Details

The DriftEngine lives in the trading-runtime module of the trading-bridge monorepo. It receives a StrategyDriftInput record containing:

The strategy ID and current deployment label
An optional baseline BacktestRunMetrics (the snapshot from the promote gate)
An optional baseline config hash (to detect config divergence)
A list of broker observations -- each one a BrokerObservation record with run ID, label, start time, config hash, metrics, and trade count

public DriftEvaluation evaluate(StrategyDriftInput input) {
    Instant now = input.evaluatedAt() != null ? input.evaluatedAt() : Instant.now();

    if (input.deploymentLabel().isPresent()
            && !input.deploymentLabel().get().isBrokerBacked()) {
        return insufficient(input.strategyId(), input.deploymentLabel(),
            "Drift requires broker execution deployment", now);
    }

    List<BrokerObservation> brokerRuns = input.brokerObservations().stream()
        .filter(obs -> obs.label().isBrokerBacked())
        .toList();
    // ...
}

The thresholds are loaded from a DriftThresholds record with sensible defaults:

minObservationCount: 10 trades before evaluation
maxWinRateDriftPct: 15 percentage points
maxProfitFactorDriftRatio: 0.5 (live PF must be at least 50% of baseline)
maxDrawdownMultiplier: 1.5x of baseline max drawdown
maxAvgTradeDurationDriftRatio: 2.0x of baseline
evaluationWindowDays: 30-day rolling window

These are not magic numbers. Each one was tuned against six months of paper trading where strategies that later failed in live execution showed clear metric divergence 2-3 weeks before P&L turned negative.

The Broker Observation Pipeline

Drift observations come from the BrokerRunExecutor, which handles the actual strategy execution against a broker (OANDA paper or live). Every order submission, fill, rejection, and risk block is persisted to the EventStore. After each execution cycle, the accumulated events are reduced into a BacktestRunMetrics object and stored as a BrokerObservation.

The event store is SQLite-backed (same pattern as the backtest event store). Each run produces a sequence of RunEvent records: RUN_STARTED, ORDER_SUBMITTED, ORDER_FILLED, ORDER_REJECTED, RISK_BLOCKED, KILL_BLOCKED, DAILY_DD_BLOCKED, RUN_COMPLETED. Reducing these into metrics is a straightforward O(n) pass over the event stream.

Gating: When Drift Blocks Execution

The drift evaluation is not advisory. The KillSwitchRegistry reads the latest drift recommendation before every execution cycle. If the recommendation is PAUSE, the kill switch engages and the strategy does not run. The operator receives an alert through the control plane and reviews the data before manually resetting the kill switch.

This is the same kill switch mechanism used for daily drawdown limits, account-level risk limits, and manual intervention. The drift engine is one of several inputs to the kill switch registry, which applies a logical OR: if any input says pause, the strategy pauses.

Why Not ML?

Every trading system blog post eventually asks: why not train a model to detect drift? Because drift detection is a threshold problem, not a pattern recognition problem. The baseline is known. The live metrics are measurable. The question is whether the difference exceeds a bound that matters for survival.

A decision matrix with five explicit thresholds is more maintainable, more auditable, and more trustworthy than a black-box classifier. When a strategy gets paused, the operator can see exactly why: win rate dropped from 62% to 44%, and max drawdown hit 2.1x the baseline. No feature importance scores, no SHAP values, no mystery.

Practical Results

Over three months of live trading across 12 strategies, the drift engine has:

Paused 2 strategies that later recovered after parameter re-optimization (correct pause, no harm done)
Flagged 1 strategy for monitoring that eventually failed the baseline (early warning, 11 days before P&L went negative)
Never paused a strategy that was performing correctly (zero false positives -- so far)

Small sample size, but directionally correct. The cost of a false positive is missed opportunity (the strategy does not trade for one cycle). The cost of a false negative is a blown account. The thresholds are tuned to err on the side of caution.

Drift detection is not glamorous. It is not AI-powered, it does not use neural networks, and it will never generate a tweet-worthy chart. But it is the single most important piece of infrastructure between a strategy getting promoted to live and that strategy taking an unacceptable loss. The backtest is the promise. Drift detection is the auditor.