Integration Tests
Overview
Integration tests exercise the full request path through the game server and SQLite storage using real processes and transports. These tests increase trust in end-to-end behavior across the game gRPC service. Contributors should use the public runtime Make targets rather than older integration/scenario-specific aliases.
Goals
- Validate game gRPC traffic through direct client calls.
- Verify server and storage wiring in one run.
- Keep tests deterministic by avoiding or normalizing random output.
- Support local runs and CI execution.
Non-goals
- Performance or load testing.
- Cross-platform process orchestration beyond standard CI runners.
Execution Model
- Start the game and auth servers in-process on ephemeral ports.
- Dial gRPC clients for each service (campaign, participant, character, session, fork, etc.).
- Exercise service operations and assert responses.
The integration harness creates a shared fixture stack and provides per-test suites with gRPC clients and user identity. AI-scoped fixtures exercise the full orchestration path including direct tool dispatch.
Canonical Service Chains
Use integration tests for cross-service runtime chains that need real transport, storage, and startup wiring:
invite -> worker -> notifications -> userhub: prove invite outbox events become inbox notifications and then appear on the dashboard.web -> play -> game: prove authenticated web launch reaches play and a real interaction mutates game state.admin -> game: prove admin pages and HTMX refresh paths stay aligned with live game mutations.discovery -> game: prove builtin starter reconciliation creates a real starter campaign and persists the resulting discoverysource_id.userhubdegraded mode: prove one optional downstream can fail without breaking the whole dashboard.
Keep pure game acceptance behavior in scenario scripts. Do not move game-only workflows into integration tests just to increase service count.
Lane Ownership
make test: unit and package-level seams. Keep this fast.make smoke: one representative test per critical service chain plus the scenario smoke manifest.make check: full local confidence before PR update.
When adding a new runtime feature, prefer extending one canonical service-chain suite instead of creating another bespoke bootstrap stack.
Determinism and Randomness
- Prefer deterministic endpoints for assertions (example: duality_outcome).
- For responses with IDs, timestamps, or seeds, validate structure and reuse values across steps instead of matching exact strings.
- Parse timestamps as RFC3339 and assert non-empty IDs.
Tagging and CI
- Integration tests use the build tag: integration.
- Local run:
go test -tags=integration ./...
Live AI capture
The GM bootstrap fixture can also be refreshed from a live model run. This is a manual lane, not part of normal CI, and it exists to prove that a real model can use the exposed tools before the resulting exchange is replayed deterministically.
Run the live lane with:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureBootstrap -count=1
For Daggerheart capability/mechanics guidance changes, also run the live character-capability lane so the recording proves the model can inspect a sheet before committing a mechanics-aware beat:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureCapabilityLookup -count=1
For authoritative Daggerheart mechanics-tool changes, also run the live review lane so the recording proves the model can combine sheet lookup, action resolution, and GM review resolution in one turn:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureMechanicsReview -count=1
For Daggerheart combat-procedure changes, also run the live attack-review lane so the recording proves the model can combine sheet lookup, combat-board inspection, and the full attack-flow tool during GM review:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureAttackReview -count=1
For Daggerheart reaction-procedure changes, also run the live reaction-review lane so the recording proves the model can combine sheet lookup and the reaction-flow tool during GM review:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureReactionReview -count=1
For Daggerheart playbook/reference changes, also run the live playbook attack lane so the recording proves the model can discover a repo-owned playbook via system_reference_search/read before resolving combat:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCapturePlaybookAttackReview -count=1
For Daggerheart board-control changes, also run the live spotlight-board review lane so the recording proves the model can discover the spotlight/countdown playbook guidance, mutate adversary and countdown state, and then re-read the board before opening the next beat:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureSpotlightBoardReview -count=1
For Daggerheart countdown-trigger lifecycle changes, also run the live countdown-trigger review lane so the recording proves the model can create a scene countdown, advance it to TRIGGER_PENDING, resolve the trigger, and re-read the board before opening the next beat:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureCountdownTriggerReview -count=1
For Daggerheart GM Fear placement changes, also run the live GM-move placement lane so the recording proves the model can create an adversary, spend Fear through daggerheart_gm_move_apply, and re-read the board before reopening the scene:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureGMMovePlacementReview -count=1
For Daggerheart adversary combat-procedure changes, also run the live adversary-attack review lane so the recording proves the model can inspect the board, resolve an adversary attack, and then reopen play:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureAdversaryAttackReview -count=1
For Daggerheart group-action and tag-team tooling changes, also run the live group-action and tag-team lanes so the recording proves the model can read the relevant character sheets before using the coordinated combat tools:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run 'TestAIGMCampaignContextLiveCapture(GroupActionReview|TagTeamReview)$' -count=1
To run the full Daggerheart live mechanics suite added on this branch in one batch:
INTEGRATION_OPENAI_API_KEY=... \
go test -tags='integration liveai' ./internal/test/integration \
-run 'TestAIGMCampaignContextLiveCapture(CapabilityLookup|MechanicsReview|AttackReview|ReactionReview|PlaybookAttackReview|SpotlightBoardReview|CountdownTriggerReview|GMMovePlacementReview|AdversaryAttackReview|GroupActionReview|TagTeamReview)$' \
-count=1
Optional environment variables:
INTEGRATION_AI_MODEL: model name to use; defaults togpt-5-miniINTEGRATION_AI_REASONING_EFFORT: Responses API reasoning effort; when unset, the live lane leaves the provider default in placeINTEGRATION_OPENAI_RESPONSES_URL: alternate OpenAI-compatible Responses URLINTEGRATION_AI_WRITE_FIXTURE=1: allow the test to overwrite the committed replay fixture after a successful live run
Behavior:
- Raw live provider captures are always written under
.tmp/ai-live-captures/for local inspection. - Each live capture, including failed runs that reached the execution path, writes sibling
.summary.jsonand.diagnostics.jsonartifacts with the structured failure summary, quality-metric status, tool/reference counts, token usage, and the related raw/markdown artifact names. - The committed replay fixture is updated only when
INTEGRATION_AI_WRITE_FIXTURE=1is set. - Failed live runs do not overwrite the committed fixture.
For the current checked-in Daggerheart mechanics comparison table built from those summaries, see daggerheart-live-mechanics-matrix.md.
OpenViking Evaluation Status
The first retrieval-before-prompt evaluation phase is now complete enough to guide the next integration step. These results use the pinned OpenViking v0.2.10 sidecar, docs_aligned_supplement, and the latest validated gpt-5.4-mini runs from March 25, 2026.
| Lane | Baseline input tokens | OpenViking input tokens | Result | Notes |
|---|---|---|---|---|
Bootstrap | 85,954 | 67,475 | clean_pass -> clean_pass | Valid retrieval and clear prompt-load reduction |
MechanicsReview | 55,822 | 55,736 | clean_pass -> clean_pass | Effective parity after backing-file story rendering |
ReactionReview | 69,096 | 56,388 | clean_pass -> clean_pass | Positive after repair; candidate still duplicated one memory update |
CapabilityLookup | 64,799 | 65,128 | clean_pass -> clean_pass | Clean but not a win; candidate drifted to artifact get/upsert behavior |
Current outcome: Hold / limited-adoption leaning positive.
Bootstrapis a real OpenViking win.MechanicsReviewandReactionRevieware acceptable after retrieval-path repair.CapabilityLookupis still unresolved, so this phase is not a cleanProceed.
Do not rerun the broad live matrix by default. The current next steps are:
- investigate
CapabilityLookuptoken drift and artifact behavior before spending on more retrieval-first lane comparisons - continue the separate session-memory runtime track
- only after those are resolved, decide whether to expand lanes or default-enable
docs_aligned_supplement
When you do need to reproduce a lane, keep the comparison shape identical:
INTEGRATION_AI_MODELINTEGRATION_AI_REASONING_EFFORTINTEGRATION_OPENAI_RESPONSES_URL- scenario prompt and fixture state
- fixture-write behavior: leave
INTEGRATION_AI_WRITE_FIXTUREunset
The live lane still defaults to an augmentation-only OpenViking evaluation when the sidecar is enabled:
FRACTURING_SPACE_AI_OPENVIKING_SESSION_SYNC_ENABLEDdefaults tofalseinside the live capture harness unless explicitly setFRACTURING_SPACE_AI_OPENVIKING_RESOURCE_SYNC_TIMEOUTdefaults to20sinside the live capture harness unless explicitly set
Set INTEGRATION_OPENVIKING_REQUIRE_VALID_AUGMENTATION=1 for candidate runs. That makes the test fail fast unless:
- augmentation was attempted
- augmentation did not degrade
- retrieval search actually ran
- at least one OpenViking resource or memory context was retrieved
Use docs_aligned_supplement as the only evaluation candidate mode. Keep legacy available only for local debugging.
Direct resource smoke:
FRACTURING_SPACE_AI_OPENVIKING_BASE_URL=http://127.0.0.1:1933 \
FRACTURING_SPACE_AI_OPENVIKING_MIRROR_ROOT=$HOME/.openviking/data/fracturing-space \
FRACTURING_SPACE_AI_OPENVIKING_VISIBLE_MIRROR_ROOT=/app/data/fracturing-space \
go test -tags='integration liveai' ./internal/test/integration \
-run TestOpenVikingResourceSearchLive -count=1
Bootstrap baseline:
INTEGRATION_OPENAI_API_KEY=... \
INTEGRATION_AI_MODEL=gpt-5-mini \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureBootstrap \
-count=1
Bootstrap candidate:
INTEGRATION_OPENAI_API_KEY=... \
INTEGRATION_AI_MODEL=gpt-5-mini \
INTEGRATION_OPENVIKING_REQUIRE_VALID_AUGMENTATION=1 \
FRACTURING_SPACE_AI_OPENVIKING_BASE_URL=http://127.0.0.1:1933 \
FRACTURING_SPACE_AI_OPENVIKING_MODE=docs_aligned_supplement \
FRACTURING_SPACE_AI_OPENVIKING_MIRROR_ROOT=$HOME/.openviking/data/fracturing-space \
FRACTURING_SPACE_AI_OPENVIKING_VISIBLE_MIRROR_ROOT=/app/data/fracturing-space \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureBootstrap \
-count=1
To reproduce the current retrieval-first evidence, use the same baseline then candidate pattern for Bootstrap, MechanicsReview, ReactionReview, and CapabilityLookup:
INTEGRATION_OPENAI_API_KEY=... \
INTEGRATION_AI_MODEL=gpt-5-mini \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureMechanicsReview \
-count=1
INTEGRATION_OPENAI_API_KEY=... \
INTEGRATION_AI_MODEL=gpt-5-mini \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureReactionReview \
-count=1
INTEGRATION_OPENAI_API_KEY=... \
INTEGRATION_AI_MODEL=gpt-5-mini \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureCapabilityLookup \
-count=1
INTEGRATION_OPENAI_API_KEY=... \
INTEGRATION_AI_MODEL=gpt-5-mini \
INTEGRATION_OPENVIKING_REQUIRE_VALID_AUGMENTATION=1 \
FRACTURING_SPACE_AI_OPENVIKING_BASE_URL=http://127.0.0.1:1933 \
FRACTURING_SPACE_AI_OPENVIKING_MODE=docs_aligned_supplement \
FRACTURING_SPACE_AI_OPENVIKING_MIRROR_ROOT=$HOME/.openviking/data/fracturing-space \
FRACTURING_SPACE_AI_OPENVIKING_VISIBLE_MIRROR_ROOT=/app/data/fracturing-space \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureMechanicsReview \
-count=1
INTEGRATION_OPENAI_API_KEY=... \
INTEGRATION_AI_MODEL=gpt-5-mini \
INTEGRATION_OPENVIKING_REQUIRE_VALID_AUGMENTATION=1 \
FRACTURING_SPACE_AI_OPENVIKING_BASE_URL=http://127.0.0.1:1933 \
FRACTURING_SPACE_AI_OPENVIKING_MODE=docs_aligned_supplement \
FRACTURING_SPACE_AI_OPENVIKING_MIRROR_ROOT=$HOME/.openviking/data/fracturing-space \
FRACTURING_SPACE_AI_OPENVIKING_VISIBLE_MIRROR_ROOT=/app/data/fracturing-space \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureReactionReview \
-count=1
INTEGRATION_OPENAI_API_KEY=... \
INTEGRATION_AI_MODEL=gpt-5-mini \
INTEGRATION_OPENVIKING_REQUIRE_VALID_AUGMENTATION=1 \
FRACTURING_SPACE_AI_OPENVIKING_BASE_URL=http://127.0.0.1:1933 \
FRACTURING_SPACE_AI_OPENVIKING_MODE=docs_aligned_supplement \
FRACTURING_SPACE_AI_OPENVIKING_MIRROR_ROOT=$HOME/.openviking/data/fracturing-space \
FRACTURING_SPACE_AI_OPENVIKING_VISIBLE_MIRROR_ROOT=/app/data/fracturing-space \
go test -tags='integration liveai' ./internal/test/integration \
-run TestAIGMCampaignContextLiveCaptureCapabilityLookup \
-count=1
Use these summary fields for lane comparison:
result_classtool_error_countopenviking_enabledopenviking_modeinitial_prompt_has_story_mdinitial_prompt_has_memory_mdretrieved_resource_countretrieved_memory_countretrieved_rendered_urisretrieved_content_sourcesinput_tokensoutput_tokensreasoning_tokenstotal_tokens
If a new candidate run degrades before retrieval or returns zero retrieved contexts, stop there and fix the OpenViking path before spending on more scenarios.
Direct OpenViking session-memory check
The live GM lane is intentionally running augmentation-first right now, so session memory remains a separate seam check rather than part of the first paid adoption gate.
FRACTURING_SPACE_AI_OPENVIKING_BASE_URL=http://127.0.0.1:1933 \
go test -tags='integration liveai' ./internal/test/integration \
-run TestOpenVikingSessionMemoryLive -count=1
Treat that check as sidecar-seam evidence only:
- it proves session create/message append/commit/search against OpenViking
- it does not prove the AI-service runtime session-sync path is ready for default use
- runtime session sync and session-memory retrieval inside campaign turns are the next integration track after the retrieval-before-prompt work
- do not use a passing seam check as evidence that OpenViking memory should already replace curated recap or prompt-time memory artifacts
Promptfoo evaluation lane
Promptfoo now has a non-gating phase-2 evaluation lane for comparing live AI GM behavior across models and instruction profiles without replacing the repo-owned Go orchestration harness.
Run the fast core comparison with:
INTEGRATION_OPENAI_API_KEY=... make ai-eval-promptfoo-core
Run the deeper decision matrix with:
INTEGRATION_OPENAI_API_KEY=... make ai-eval-promptfoo-decision
To inspect recent Promptfoo runs in the local web UI:
make ai-eval-promptfoo-view
If the default Promptfoo viewer port is already occupied, choose another one:
make ai-eval-promptfoo-view PROMPTFOO_VIEW_PORT=15501 PROMPTFOO_VIEW_ARGS="--no"
Notes:
- This evaluation lane is not part of
make check. - It uses the live AI capture lane through
cmd/aieval, then emits Promptfoo-friendly JSON for matrix comparison and report generation. make ai-eval-promptfoo-coreruns the defaultcorescenario set once per case for quick engineering iteration.make ai-eval-promptfoo-decisionruns the samecoreset with three repeats per case for model or prompt-profile comparison.- The
coreset focuses on mechanics-fidelity scenarios such as Hope spend + experience use, stance capability checks, narrator authority, and subdue intent. Theextendedset covers playbook/reference and spotlight-board lanes. - Use
PROMPTFOO_ARGS='...'to pass filters or output options through to the underlyingpromptfoo eval. - The wrapper uses
promptfoo@latestby default. SetPROMPTFOO_NPX_SPECif you need to force a specific Promptfoo package version for one run. - Promptfoo persistence is routed to
.tmp/promptfoo-home/by default so the local database, logs, andviewstate stay in a writable repo-local path. OverridePROMPTFOO_CONFIG_DIRonly when you intentionally want a different Promptfoo home. make ai-eval-promptfoo-viewrunsnpx promptfoo@latest viewso you can inspect recent eval results, failed assertions, and per-case output details in the Promptfoo UI. When Promptfoo does not persist a fresh headless eval on its own, the repo wrapper synthesizesresults.jsonfrom captured provider case outputs and imports that eval into Promptfoo so the viewer still has a fresh local record to open.- Set
PROMPTFOO_VIEW_PORTwhen15500is already in use. UsePROMPTFOO_VIEW_ARGS="--no"when you want the server to start without attempting to open a browser. - Each run writes a stable artifact bundle under
.tmp/promptfoo/<run-id>/withresults.json,scorecard.md, per-case provider captures undercases/, and any captured harness logs. - Each Promptfoo case is isolated with a stable case id so concurrent model/prompt/repeat runs do not overwrite one another’s eval JSON or live capture artifacts.
- Promptfoo failures are intentionally compact in the report. Raw
go teststderr/stdout is preserved in artifact logs instead of being embedded inline in the Promptfoo error field, while structured live.diagnostics.jsonartifacts carry the useful failure description. - Promptfoo scorecards separate quality failures from invalid runtime runs. Invalid runs stay visible in the report, but they do not count against the model-quality pass rate.
- Promptfoo is the comparison/reporting layer only. The live Go harness remains the authoritative execution path, and replay fixtures remain the deterministic regression surface.
Phase 2 status
Phase 2 is complete for local operator use:
make ai-eval-promptfoo-core,make ai-eval-promptfoo-decision, andmake ai-eval-promptfoo-vieware the supported command surface.- compact failure summaries, per-case diagnostics, and stable artifact bundles under
.tmp/promptfoo/<run-id>/are expected outputs, not optional extras. - Promptfoo remains non-gating and does not replace replay or live integration tests.
What to do now
Use the existing phase-2 surface for comparison and diagnosis instead of adding more Promptfoo plumbing for now:
- run
make ai-eval-promptfoo-corewhen a model, prompt-profile, or GM-control change needs a fast comparison against the canonical scenario set - run
make ai-eval-promptfoo-decisionbefore changing the preferred GM model or default instruction profile - inspect
.tmp/promptfoo/<run-id>/scorecard.mdfirst, then follow artifact links into.summary.json,.diagnostics.json, raw captures, and harness logs when a row needs deeper debugging - treat
metric_status=invalidrows as runtime diagnostics to fix or rerun, not as model-quality evidence
Defer new eval ladders, critique mode, and broader vendor expansion to a later phase-3 effort.
Supported verification commands
For the supported contributor workflow, use the canonical Verification commands surface. Raw go test -tags=integration ./... remains useful for low-level debugging, but it is not the default contributor path.
Scenario Sharding
Scenario tests support deterministic sharding for CI fanout. Treat shard entry points as internal CI plumbing; contributors should use make test, make smoke, and make check.
Integration Sharding
Integration tests support deterministic top-level test sharding for CI fanout and CI may invoke shard-specific targets internally.
Top-level Test... names are assigned by stable hash modulo shard total.
CI target guidance:
- Pull requests should use the public
make checksurface locally. - Main/nightly workflows may shard full runtime lanes internally for fanout.
- Nightly soak runs may enable shared-fixture variants as internal workflow detail.
Runtime Reporting
Runtime reports are generated from go test -json output by CI/internal automation, and the public local verification commands now also emit live status artifacts under .tmp/test-status/. Treat the shard scripts and report generation helpers as internal plumbing; the supported public surface remains make test, make smoke, and make check.
Checklist
-
If event definitions changed, run
go run ./internal/tools/eventdocgenand confirm the event catalog is updated in the diff. -
Use the public Make verification surface above; avoid depending on retired shard/plumbing targets in contributor-facing docs.