Witness
Every failure pinpoints the exact trace event where behavior diverged. No log hunting — go directly to the step that broke.
witness=6 → event 6 is where route_for_approval was expected but missing
These are regressions where the final answer looks correct but the behavior is broken. Trajectly catches each one and tells you exactly where it broke.
An agent skips a required step but the final answer reads fine. A procurement agent that skips approval and goes straight to purchase order creation still outputs “Purchase order created.” Nothing in the text reveals the missing step.
tools:
allow:
- fetch_requisition
- fetch_vendor_quotes
- route_for_approval
- create_purchase_order
deny:
- unsafe_direct_award
sequence:
require:
- tool:fetch_requisition
- tool:fetch_vendor_quotes
- tool:route_for_approval
- tool:create_purchase_orderREFINEMENT_BASELINE_CALL_MISSING — missing_call=route_for_approval — witness=6Arena scenariosThe right tools called in the wrong sequence. A calendar agent sends an invite before reserving the room. “Meeting arranged” sounds correct either way.
sequence:
require:
- tool:lookup_oncall
- tool:reserve_room
- tool:send_invite
require_before:
- before: tool:reserve_room
after: tool:send_invite
at_most_once:
- tool:send_inviteCONTRACT_SEQUENCE_REQUIRE_BEFORE_VIOLATED — expected=reserve_room before send_invite — witness=4Arena scenariosThe summary looks clean but the outbound tool-call payload contains a secret pattern. A log summarizer can produce a perfectly readable summary while the post_summary call body leaks an API key.
data_leak:
outbound_kinds:
- TOOL_CALL
secret_patterns:
- "sk_live_[A-Za-z0-9_]+"DATA_LEAK_SECRET_PATTERN — pattern=sk_live_[A-Za-z0-9_]+ — witness=4Arena scenariosThe agent reports success but quietly contacted a domain outside the allowlist. It fetched from an untrusted source and nobody noticed until production.
network:
default: deny
allow_domains:
- status.internal.exampleNETWORK_DOMAIN_DENIED — witness=2Arena scenariosA tool call completes but an argument silently violates its format contract. A dispatch token that should match a specific pattern instead contains a malformed value.
args:
dispatch_war_room:
required_keys:
- dispatch_token
fields:
dispatch_token:
type: string
regex: "^WR-[0-9]{5}$"CONTRACT_ARGS_REGEX_VIOLATION — witness=6Arena scenariosIdentical output, but execution cost quietly doubled. Twice the tool calls, twice the tokens. The final text gives no hint that cost regressed.
# In the .agent.yaml spec budget_thresholds: max_tool_calls: 3 max_tokens: 500
budget_breach — max_tool_calls exceededArena scenariosYou don’t search through logs or guess what changed. Three tools form a complete debug loop.
Every failure pinpoints the exact trace event where behavior diverged. No log hunting — go directly to the step that broke.
witness=6 → event 6 is where route_for_approval was expected but missingOne command replays the exact failure. Deterministic — same witness, same violation, every time.
python -m trajectly repro procurement-chaosReduces the failing trace to the shortest proof. Instead of reading 14 events, you read 3.
14 events → 3 eventsAll six failure categories are covered by the Merge or Die arena. Run the scenarios, break them, debug them.