LLM agents ship with scope boundaries — explicit definitions of what they should and should not do — yet runtime behavior consistently ignores them. Building on the dataset and static analysis from the State of the Agent census, we analyze 428 open-source agent configurations across 18 evaluation domains, measuring whether boundary instructions translate into actual behavioral compliance at runtime.
Using a combination of static analysis and 10,000 live LLM probe calls, we find a systematic disconnect: 72% of agents include boundary language in their system prompts, yet produce a mean boundary respect score of just 0.30 and a refusal health of 0.04. Agents confidently answer questions far outside their stated expertise.
We introduce three behavioral trace metrics, coherence, decisiveness, and verbosity, that measure the structural pattern of a response rather than its content. These signals require no additional API calls and no ground-truth labels, computed entirely from the existing response text. Across the full corpus, behavioral incoherence (disagreement between self-reported confidence and linguistic hedging) appears systematic, present in every domain cluster we measured, and correlates with the lowest boundary respect scores in the dataset.
The term "say-do gap" describes the disconnect between an agent's declared capabilities and its runtime behavior. The State of the Agent census first quantified this pattern through static analysis of 428 agent system prompts: developers write careful scope definitions, expecting agents to stay within boundaries, but the runtime measurements in Table 1 show those instructions have limited effect on actual behavior.
| Metric | What It Measures | Mean | Median | Signal |
|---|---|---|---|---|
| Scope Clarity | Domain keyword coverage in system prompt | 0.84 | 1.00 | Agents know what they do |
| Boundary Definition | Presence of scope-limiting language | 0.59 | 0.70 | Most set boundaries |
| Uncertainty Guidance | Hedging/uncertainty instructions | 0.35 | 0.30 | Few address "I don't know" |
| Boundary Respect | Actually hedges on out-of-scope probes | 0.293 | 0.333 | Boundaries ignored at runtime |
| Calibration | Correctly answers in-scope questions | 0.960 | 1.000 | Competent within scope |
| Consistency | Stable across stochastic runs | 0.829 | 0.929 | Failures are systematic, not stochastic |
| Refusal Health | Declines out-of-scope questions | 0.042 | 0.000 | Near-zero refusal rate |
Agents score 0.84 on scope clarity (they know what they do) but 0.042 on refusal health (they almost never decline out-of-scope questions). Their 0.96 calibration score confirms they answer in-scope questions correctly, while 0.83 consistency shows that boundary failures are systematic rather than stochastic: the same agent, given the same out-of-scope probe across multiple runs, produces the same confident wrong answer each time.
Content-level metrics (confidence number, hedging phrases, explicit refusal) capture what an agent says, whereas behavioral trace metrics capture the structural pattern of the response itself: internal consistency between stated confidence and linguistic hedging, the position of hesitation signals within the text, and the economy of expression. This approach draws on work by Gloaguen, Music, and Poesia [1], who found that context files change agent behavior (more exploration, more reasoning tokens, more tool calls) without improving outcomes.
We introduce three behavioral metrics that require zero additional API calls. Each is computed from the raw response text already collected during live probing.
| Metric | Formula | Range | What It Catches |
|---|---|---|---|
| Coherence | 1.0 - |normConf - lingConf| | 0.0 – 1.0 | "I'm not sure" + CONFIDENCE: 90 |
| Decisiveness | 1.0 - (firstHedgePos / len) | 0.0 – 1.0 | Burying the hedge 500 words deep |
| Verbosity | len(fields(raw)) | 0 – ∞ | 500 words to say "I don't know" |
"Context files change agent behavior (more exploration, more reasoning tokens, more tool calls) without improving outcomes. The gap between behavioral change and outcome change is measurable."Gloaguen, Music, and Poesia, "Evaluating AGENTS.md" (arXiv:2504.01441, 2025)
Coherence measures the agreement between an agent's self-reported confidence number and the hedging patterns in its linguistic output. When an agent writes "I'm not really sure about this, it's outside my expertise" and then reports CONFIDENCE: 85, the signals are contradictory. The agent is either miscalibrated in its self-assessment or using hedging language performatively without conviction.
The formula normalizes both signals to the same [0, 1] scale:
normalizedConf = confidence / 100
linguisticConf = 1.0 - hedgingScore
coherence = 1.0 - abs(normalizedConf - linguisticConf)
Coherent responses cluster in two regions: the upper-right (confident language, high confidence number) and the lower-left (hedging language, low confidence number). Incoherent responses appear off-diagonal: agents that hedge heavily but report high confidence, or agents that write assertively but report low confidence.
The research question this metric enables: does incoherence predict incorrect answers? If low coherence correlates with factual errors on calibration probes where ground truth is known, coherence becomes a cheap proxy signal for reliability, requiring no ground-truth labels at inference time.
Decisiveness measures how early in a response the agent reaches its signal. On boundary probes (out-of-scope questions), the ideal response is a fast hedge or refusal. An agent that writes three paragraphs of tangentially related content before admitting "but I'm not really sure about this" has low decisiveness: it recognized the boundary, but only as an afterthought.
Decisiveness is measured only on boundary probes, where the position of the first hedging or refusal pattern is recorded as a fraction of total response length. A position near 0.0 indicates the agent immediately recognized the scope boundary, while a position near 1.0 indicates the hedge was buried deep in the response, appended as a postscript after paragraphs of tangential content.
Behavioral traces allow us to distinguish three structurally different failure modes at scope boundaries, each with different implications for agent design.
Each failure mode points to a different intervention: "confident and wrong" agents need stronger boundary instructions or structured refusal mechanisms, "verbose deflection" agents benefit from explicit conciseness constraints, and "late hedge" agents already have the boundary recognition capacity but lack the decisional priority to act on it before generating hundreds of tokens of tangential content.
Data source. 496 agent definition files from the wshobson/agents open-source corpus, deduplicated by SHA-256 content hash to 428 unique agent primitives across 18 evaluation domains. The dataset and static analysis methodology are described in the State of the Agent census report.
Static analysis. Each system prompt is scored for domain keyword coverage (scope clarity), presence of scope-limiting language (boundary definition), and hedging/uncertainty instructions (uncertainty guidance). Fully deterministic, no LLM involved.
Runtime probing. 2,500 probe questions (boundary and calibration types) delivered to each agent as a system prompt injected into a fresh LLM context. Each probe runs once deterministically (temperature 0) and multiple times stochastically (temperature 0.7) to measure consistency. 10,000 total API calls via Llama 3.3 70B.
Behavioral scoring. Coherence, decisiveness, and verbosity are computed from the raw response text of the 10,000 probe responses. No additional API calls. The rescore command in agent-evals allows retroactive computation of behavioral metrics from existing report JSON or markdown transcripts.
Limitations. All runtime probes use a single model family (Llama 3.3 70B). Behavioral patterns may differ across model architectures, sizes, and alignment tuning approaches. The coherence metric depends on the agent reporting a numerical confidence value, which not all probed agents do consistently. Decisiveness is measured only on boundary probes and may not generalize to other probe types.
Future work. Validate coherence as a predictive signal against ground-truth calibration data. Extend probing to multiple model families. Implement multi-turn boundary probes to test whether conversational drift degrades boundary compliance. Measure the effect of specific prompt interventions (explicit uncertainty instructions, role framing, boundary enumeration) on behavioral trace metrics.