Campaign AI Mechanics Quality
Durable follow-on design guidance for mechanics-heavy AI GM turns. This note is grounded in the March 22, 2026 Daggerheart live-mechanics acceptance run, but the recommendations are framed as AI-service behavior patterns rather than system-specific one-offs. Use the live evidence in Daggerheart Live Mechanics Matrix as the factual baseline. Use this note for the next design moves.
What The Live Run Proved
- The current mechanics tool surface is broad enough for real AI GM play. The agent completed accepted live runs for sheet reads, board reads, action resolution, combat flows, Fear moves, adversary placement, and scene countdown lifecycle.
- Player-facing mechanical outcomes are usually legible in direct resolution lanes. The model can state roll outcome, damage, HP, Armor, Hope, and next action options in a way players can follow.
- Bounded reference usage is healthier than eager lookup. Board-control lanes were clean without consulting the reference corpus, while the explicit playbook lane could still succeed with one intentional search/read pair.
- The remaining instability is primarily runtime and recovery behavior. Session gates, precondition mismatches, and extra recovery calls now cause more live variance than missing mechanics tools.
Observed Gaps
- Board-control turns still leak raw engine vocabulary too easily. Countdown IDs, adversary IDs, and internal state labels are useful for QA and memory, but they should not bleed into player-facing beats.
- The current live suite proves tool execution better than intent interpretation. Many lanes still name the mechanic family up front instead of forcing the model to infer it from the player’s phrasing.
- The AI service lacks a dedicated critique surface. We can see when a run over-researched, over-recovered, or lacked context, but the production GM lane is not the place to ask the model to explain that back to us.
- Reference usage still needs stage awareness. The model should not research Fear, spotlight, or countdown procedures before those mechanics are actually relevant on the current turn.
Recommended Next Workstreams
1. Mechanics Communication Contract
For mechanics-heavy turns, committed GM interactions should follow one stable shape:
fiction: what just happened in-worldconsequence: the authoritative mechanical resultguidance: what changed in the decision spaceprompt: what the player does next
Player-facing consequence beats should explicitly name any resource or board delta that matters to play now:
- HP
- Armor
- Hope
- Stress
- Fear
- spotlight owner
- visible countdown pressure
Keep internal IDs, enum names, and engine state labels in memory or OOC notes only. The player should hear the game state change, not storage vocabulary.
2. Intent-To-Mechanics Eval Ladder
Add a dedicated evaluation track where the player’s natural-language intent is the main mechanic cue. These lanes should not name the required tool family in the player prompt.
The first ladder should cover:
- explicit Hope spend for a named feature
- named feature use that may or may not be legal from the current sheet
- domain-card use
- equipment-driven action
- impossible declared capability that should trigger clarification instead of permissive narration
- ambiguous intent that should trigger clarification instead of a bad tool call
- multi-actor intent that should become group action or tag-team resolution
Add a parallel narrator-authority check to the same ladder:
- prompt beats should ask what the player character does next
- prompt beats should not ask the player to author NPC dialogue or story-world answers
Use deterministic integration coverage first, then live-agent lanes for the same cases.
3. Diagnostic / Coach Mode
Add a separate, non-authoritative critique path that runs after a live or replay capture. It should inspect the transcript, tool trace, and summary artifacts and return:
- the chosen tool path and why it likely happened
- unnecessary reference lookups
- missing context that would have prevented a lookup or failed call
- better tool shapes, prompt policy, or always-on guidance
- player-facing clarity issues in the committed beats
This critique mode should not share the authoritative GM channel. It is a product and tooling analysis surface, not part of the live scene turn.
4. Bounded Reference Strategy
Preserve the current two-layer model and make it explicit:
- short always-on operational primer for common mechanics turns
- on-demand playbooks and reference reads only when exact procedure is unclear
Reference budgets should become part of evaluation policy:
- no reference lookup in board-control lanes that already have clear state and an obvious authoritative tool path
- exactly one intentional search/read pair in explicit playbook lanes
- no pre-emptive Fear, spotlight, or countdown research before those mechanics are active in the current turn
5. Runtime Robustness
Treat session-gate and precondition failures as the primary live reliability problem now that the mechanics surface is broader. Recommended direction:
- improve precondition diagnosis before board-sensitive mechanics calls
- prefer stopping cleanly after a failed authoritative mechanic over trying an adjacent mechanic family
- keep recovery guidance corrective and narrow when a retry is actually valid
Acceptance Markers
This work is materially improved when:
- mechanics-heavy beats state resource changes without leaking engine terms
- at least one natural-language intent lane exists for each major mechanic family
- critique reports consistently identify unnecessary lookups and missing context
- live summaries show lower unnecessary reference usage
- repeated live runs fail less often due to post-error tool thrash
Relationship To Existing Docs
- Campaign AI Orchestration defines the runtime turn boundary and tool policy.
- Campaign AI Agent System defines instruction composition, channel discipline, and beat-oriented authoring.
- Campaign AI GM Guardrails codifies the enforceable guardrail contract for each recommendation in this doc.
- Campaign AI Evaluation Strategy defines how guardrails are tested through promptfoo scenarios and assertions.
- Promptfoo should remain an evaluation/reporting layer over the live Go harness, not a replacement for that execution path.
- Runtime-invalid Promptfoo rows should remain visible, but they should be tracked separately from model-quality failures so pass-rate decisions are not polluted by harness or provider instability.
- Daggerheart Live Mechanics Matrix is the evidence table, not the roadmap.