Replay Operations
Operational runbook for replay execution, projection repair, and integrity troubleshooting.
For replay architecture and invariants, start with Event replay.
When to use this runbook
Use replay operations when:
- projection state is stale or inconsistent with event journal sequence
- adapter changes require deterministic read-model rebuild
- integrity verification detects replay-path failures
Replay modes in operations
Full replay
Use for maximum confidence rebuilds after major projection or adapter changes.
Tradeoff: highest runtime cost.
Snapshot-accelerated replay
Use for routine catch-up and faster recovery when snapshot/checkpoint state is trusted.
Tradeoff: relies on snapshot/checkpoint correctness.
Partial replay
Use when replay scope is known and bounded by sequence.
Tradeoff: requires confidence in start sequence and campaign scope.
Operator workflow
- Identify affected campaign IDs and failure symptoms.
- Confirm latest checkpoint and expected sequence head.
- Choose replay mode (full/snapshot/partial).
- Execute replay run.
- Validate projection parity and checkpoint progression.
- Re-run affected integration/smoke checks if change was broad.
Detecting projection gaps
Gap indicators:
- non-contiguous projection sequence markers
- checkpoint stagnation despite new events
- known entity state mismatch with latest event-derived facts
Primary checks:
- compare campaign event head sequence vs projection/checkpoint sequences
- verify adapter routing coverage for event types in affected interval
Repairing projection gaps
- stop unsafe writes for affected scope if required
- run replay in chosen mode
- verify contiguous sequence application through target head
- compare critical projection entities against expected event outcomes
- restore normal writes after parity checks pass
Post-persist fold/apply failures
If event append succeeded but fold/apply failed:
- treat journal event as authoritative
- fix failing adapter/folder path
- rerun replay to reconcile derived state
Do not delete authoritative events to “repair” projections.
Integrity checks and constraints
- sequence continuity is mandatory
- hash/signature verification failures are blocking
- unknown system module/adapter routing is fail-fast
- replay operations must not bypass canonical apply paths
Practical safeguards
- run in smaller campaign batches when diagnosing failures
- capture before/after checkpoint and head sequences per campaign
- log reason codes and failing event types for repeated failures
- keep rollback plan for operational windows that include write pause