Operations Manifest for a Trading System: What 900 Lines of Runbook Taught About Running Code That Handles Money

The Trust Gap

After 14 weeks of paper trading with 30+ strategies across 9 forex pairs, a pattern emerged. The software ran. The fills matched. But no one could answer the question: "Is the platform healthy right now?"

The backtest engine produced results. The broker connectors stayed connected. The SQLite event store kept accepting writes. But there was no operational layer -- no definition of what "healthy" means, no structured way to detect drift, no pre-flight checklist before starting a paper run, and no incident response flow when something went wrong at 2 AM.

The gap wasn't in the code. It was in the operations.

The Three-Layer Mental Model

The first decision was structural: how do you organize what the operator needs to know?

The answer was a three-layer pyramid:

Layer 1 — Platform Health: Is the software running? Are broker connections live? Is the event store accepting writes?
Layer 2 — Strategy Health: Is each strategy behaving as expected? Is paper tracking backtest within tolerance?
Layer 3 — Portfolio Health: Am I making or losing money overall?

The rule: always check bottom-up. If Layer 1 is red, Layers 2 and 3 are unreliable. Fix the platform first, then assess strategies, then review the portfolio.

This sounds obvious in retrospect, but without it, operators jumped straight to portfolio P&L and panicked about a drawdown that turned out to be caused by a stale broker connection reporting zero positions.

SLIs and SLOs for Trading

The reliability engineering guide defined eight Service Level Indicators for the platform:

SLI	Target	Window
Trade persistence rate	>= 99.9%	7 days
Broker reconnection time (P99)	< 30 seconds	Per-event
Run success rate	>= 99.5%	30 days
Control plane availability	>= 99.9%	30 days
Reconciliation divergence rate	<= 1%	7 days

Each SLO has an error budget. When the trade persistence SLO burns through 50% of its weekly budget (10 missed events out of 10k), the event store gets investigated. At 100%, all LIVE promotes freeze.

The critical insight: error budgets change who owns the risk. Instead of a human deciding "is this stable enough to promote?", the budget makes the call. If the budget is exhausted, no promotions happen until the burning issue is resolved. This removes the emotional pressure to ship during an incident.

Incident Severity Matrix

Not all failures are equal. The manifest defined four severity levels:

P0: Capital at risk or platform down (e.g., control plane unresponsive, all brokers disconnected). Respond immediately, wake whoever is on call.
P1: Single strategy compromised or degraded (e.g., one broker unreachable, reconciliation divergence for one run). Respond within the session.
P2: Degraded operations with no immediate capital risk (e.g., stale detection firing late, heartbeat delivery below 99%). Investigate within 24 hours.
P3: Minor friction (e.g., promote gate latency above 15 seconds). Log it, fix in the next sprint.

P0 and P1 get full runbooks. P2 gets a checklist. P3 gets a Jira ticket.

The Pre-Flight Checklist

Before every paper trading session, the operator runs a five-item checklist:

Control summary shows green. All runs fresh, no gaps, no stale signals.
Broker accounts are reachable. GET /api/health returns 200, OANDA API responds.
SQLite WAL mode is active. No journaling mode mismatch.
No residual runs from the last session. Every previous run reached RUN_ENDED.
Strategies scheduled for this session are ready. No pending promote gates, no unresolved reconciliation alerts.

Items 1-3 take 30 seconds. Items 4-5 take another 60 seconds. Total: 90 seconds of deliberate verification before a run that may last 6 hours.

Runbook Structure

Each runbook in the reliability guide follows a consistent pattern:

1. CONFIRM     -- Verify the signal is real (curl the endpoint, check the log)
2. ISOLATE     -- Determine scope (one strategy? all strategies? one broker?)
3. RESOLVE     -- Execute the fix (restart the run, switch broker, roll back)
4. VERIFY      -- Confirm the fix worked and the platform returned to green
5. DOCUMENT    -- Log the incident, update the runbook if the fix was novel

The structure ensures no one skips confirmation and starts fixing a problem that already resolved itself. The "CONFIRM" step alone prevents most false-alarm incidents.

What the Post-Mortem Template Looks Like

Every incident gets a post-mortem with five sections:

Summary: One paragraph. What happened, when, what was the blast radius.
Timeline: Minute-by-minute from first signal to all-clear.
Root cause: One sentence. No blame, no excuses.
What worked: Runbook steps that executed correctly, monitoring that fired on time.
Action items: Concrete changes to code, config, or the runbook itself.

The last section is the most important. If a runbook step was unclear or wrong, that is a bug in the runbook, and it gets fixed immediately. The manifest is version-controlled alongside the code for exactly this reason.

Why This Matters

A trading system that works in theory but has no operational layer is not production-ready. The code handles edge cases. The tests pass. But the first time a broker disconnects at 2 AM with no operator present, the difference between chaos and a structured response is the runbook that was written while everything was still green.

The 900-line operations manifest and the 800-line reliability guide were written in two afternoons. They represent a fraction of the codebase size. But they are the most valuable documentation in the repository, because they define what "healthy" means, how to detect it drifting, and what to do when it breaks.

Trust is not built by writing better code. It is built by proving the platform works -- repeatedly, measurably, and with a paper trail.