The Testing Pyramid That Actually Works for Trading Software

Unit tests catch bugs. Integration tests catch regressions. Golden backtest baselines catch the silent failures that make you lose money at 3 AM.

MF
Martin Fournier
· June 13, 2026 · 6 MIN READ
Illustration for: The Testing Pyramid That Actually Works for Trading Software

Every software testing guide starts with the same diagram: a pyramid with unit tests at the bottom, integration tests in the middle, and end-to-end tests at the top. Write lots of unit tests, a reasonable number of integration tests, and a few E2E tests.

This advice works great for CRUD apps and web services. It falls apart when you build trading software.

The problem is not that the pyramid is wrong for the code itself. The trading-core module has clean domain models, indicators with deterministic math, and pure functions that map inputs to outputs. Standard testing advice applies there. The problem is that the pyramid does not account for the single most expensive class of bug in a trading system: the silent numerical regression.

A SQL query that returns the wrong column throws an error. A React component with a bad prop crashes the render. A trading strategy that computes entries 0.3% differently than expected does not crash. It silently loses money for six months before anyone notices.

Here is what actually works for a trading codebase with 11 Maven modules, 40+ migrated strategies, and real money on the line.

Tier 1: Deterministic unit tests for domain primitives

This is the standard bottom of the pyramid, and it applies normally. Indicators are pure math. The SMA of bars 1 through 14 is always the same number. Write parameterized tests that assert exact values. IndicatorsTest covers SMA, EMA, RSI, ATR, and every derived function with known inputs and exact expected outputs.

Bar endianness is another domain primitive that deserves its own test. Binary bar format has a byte order. When the DataLoader reads a file, every bit must decode to the correct open, high, low, close, and timestamp. One byte flip and your backtest thinks you traded at 4 AM instead of 4 PM. Test this with a hand-crafted binary buffer containing known values. Assert every field. Do not trust that endianness is "probably fine" on x86.

Time conventions also belong here. Every timestamp in the system must be UTC Instant. A strategy that uses LocalDateTime for bar comparison introduces a silent DST bug that only manifests one week per year. TimeConventionsTest enforces that no trading logic path accepts zone-aware or local types.

Tier 2: Contract tests for the backtest engine

The backtest engine is the heart of the system. It takes bars, runs a strategy, and produces trades. This is not a unit and not an integration test in the traditional sense. It is a contract test: given this exact set of bars and this exact strategy, the engine must produce exactly these trades in exactly this order with exactly these fill prices.

BacktestEngineContractTest runs the LondonOpenRangeBreakout strategy against the first 744 H1 bars of EUR/USD 2012. It asserts:

  • Total bars processed: 744
  • Total trades: 3
  • Total return: 0.0137585714285704%
  • Total PnL: $13.758571428570399
  • Max drawdown: 0.0266378078515083%

Every value is checked with tight tolerances. A change in the engine that shifts fill logic by one pip, or changes order queue semantics, or modifies how MARKET orders resolve against bar open, will fail this test.

The full-year baseline uses 8760 bars, 61 trades, and $139.67 PnL. It runs locally but skips in CI when the full historical dataset is not present. The CI subset (744 bars) is committed to the repo specifically so CI catches regressions without requiring a 200 MB data download.

Tier 3: Strategy-level golden baselines

Each production strategy gets its own golden baseline. When a strategy graduates from backtest qualification to the promote pipeline, the system captures its exact performance metrics against the canonical dataset. These values live in GoldenBacktestBaseline.java as public constants.

The promote gate reads these constants before promoting a strategy from backtest to paper trading. If the candidate strategy's metrics deviate from the baseline by more than the configured tolerance, the gate blocks the promotion and logs the delta. This catches:

  • Code changes that unintentionally alter strategy behavior
  • Data format migrations that shift bar values
  • Library upgrades that change math precision
  • Anything else that produces a different numerical answer for the same inputs

This is the most valuable test in the entire system. It has caught regressions that no unit test or integration test could have found. One example: a Jackson upgrade changed how BigDecimal deserialized from JSON, shifting slippage calculations by 0.0002% per trade. The golden baseline flagged it. The unit tests all passed because they used hardcoded double literals.

Tier 4: Monte Carlo and portfolio correlation tests

These sit above the golden baselines. They do not test correctness of a single run. They test statistical properties across many runs.

MonteCarloSimulationTest runs 10,000 randomized parameter variations and checks that the distribution of returns, drawdowns, and Sharpe ratios falls within expected ranges. It is stochastic, so it uses confidence intervals instead of exact assertions. A strategy that passes its golden baseline but fails Monte Carlo often has a hidden path dependency or overfitting issue.

CorrelationMatrixTest runs every strategy pair against the same data and computes return correlation. High-correlated pairs get flagged for portfolio review. This is not a pass/fail test. It produces a report that feeds the portfolio construction workflow.

Where the standard pyramid survives

The standard pyramid works fine for infrastructure code. Broker connector tests (trading-data, trading-broker) use standard integration test patterns with wiremock stubs for OANDA REST endpoints. The control plane HTTP server uses typical web framework testing. The runtime event store uses in-memory SQLite for test isolation.

These modules follow normal testing conventions because they are normal software. The database connection either works or it doesn't. The HTTP endpoint either returns 200 or it doesn't. Standard testing applies.

The pattern generalizes

Any system where numerical correctness is more important than functional correctness needs this approach. Financial models, scientific computing, simulation engines, game physics, and ML inference pipelines all benefit from golden baselines over traditional integration tests.

The principle is simple: when the cost of a silent wrong answer exceeds the cost of a crash, your tests must assert exact numerical values. Not "approximately correct." Not "within a reasonable range." Exact. With a tolerance that you tighten over time.

Unit tests catch bugs. Integration tests catch regressions. Golden backtest baselines catch the silent failures that make you lose money at 3 AM when the market opens and your strategy computes entries on stale data because a data loader change shifted bar timestamps by one nanosecond.