Research Paper

Confidence Without Calibration

How LLM agents fail at scope boundaries, and what behavioral traces reveal about the gap between instruction and execution
Brandon Huey · April 2026
428 agents
10,000 LLM calls
72% claim boundaries
0.04 mean refusal rate

Abstract

LLM agents ship with scope boundaries — explicit definitions of what they should and should not do — yet runtime behavior consistently ignores them. Building on the dataset and static analysis from the State of the Agent census, we analyze 428 open-source agent configurations across 18 evaluation domains, measuring whether boundary instructions translate into actual behavioral compliance at runtime.

Using a combination of static analysis and 10,000 live LLM probe calls, we find a systematic disconnect: 72% of agents include boundary language in their system prompts, yet produce a mean boundary respect score of just 0.30 and a refusal health of 0.04. Agents confidently answer questions far outside their stated expertise.

We introduce three behavioral trace metrics, coherence, decisiveness, and verbosity, that measure the structural pattern of a response rather than its content. These signals require no additional API calls and no ground-truth labels, computed entirely from the existing response text. Across the full corpus, behavioral incoherence (disagreement between self-reported confidence and linguistic hedging) appears systematic, present in every domain cluster we measured, and correlates with the lowest boundary respect scores in the dataset.

Core finding: The gap between what agents are told to do and what they actually do is measurable, consistent, and visible in behavioral traces that are cheap to compute. Rather than refusing out-of-scope questions, agents answer them with high stated confidence and no linguistic hesitation.

The Say-Do Gap

The term "say-do gap" describes the disconnect between an agent's declared capabilities and its runtime behavior. The State of the Agent census first quantified this pattern through static analysis of 428 agent system prompts: developers write careful scope definitions, expecting agents to stay within boundaries, but the runtime measurements in Table 1 show those instructions have limited effect on actual behavior.

Table 1 — Static Analysis vs. Runtime Behavior (n=428)
Metric What It Measures Mean Median Signal
Scope Clarity Domain keyword coverage in system prompt 0.84 1.00 Agents know what they do
Boundary Definition Presence of scope-limiting language 0.59 0.70 Most set boundaries
Uncertainty Guidance Hedging/uncertainty instructions 0.35 0.30 Few address "I don't know"
Boundary Respect Actually hedges on out-of-scope probes 0.293 0.333 Boundaries ignored at runtime
Calibration Correctly answers in-scope questions 0.960 1.000 Competent within scope
Consistency Stable across stochastic runs 0.829 0.929 Failures are systematic, not stochastic
Refusal Health Declines out-of-scope questions 0.042 0.000 Near-zero refusal rate

Agents score 0.84 on scope clarity (they know what they do) but 0.042 on refusal health (they almost never decline out-of-scope questions). Their 0.96 calibration score confirms they answer in-scope questions correctly, while 0.83 consistency shows that boundary failures are systematic rather than stochastic: the same agent, given the same out-of-scope probe across multiple runs, produces the same confident wrong answer each time.

Figure 1: The Say-Do Gap
Boundary Claim Rate (Static)
72%
of agents include scope-limiting language
Boundary Respect (Runtime)
30%
mean score on out-of-scope probes
Refusal Rate
4%
of out-of-scope probes refused
Consistency
83%
failures are systematic, not stochastic

Behavioral Trace Metrics

Content-level metrics (confidence number, hedging phrases, explicit refusal) capture what an agent says, whereas behavioral trace metrics capture the structural pattern of the response itself: internal consistency between stated confidence and linguistic hedging, the position of hesitation signals within the text, and the economy of expression. This approach draws on work by Gloaguen, Music, and Poesia [1], who found that context files change agent behavior (more exploration, more reasoning tokens, more tool calls) without improving outcomes.

We introduce three behavioral metrics that require zero additional API calls. Each is computed from the raw response text already collected during live probing.

Table 2 — Behavioral Trace Metrics
Metric Formula Range What It Catches
Coherence 1.0 - |normConf - lingConf| 0.0 – 1.0 "I'm not sure" + CONFIDENCE: 90
Decisiveness 1.0 - (firstHedgePos / len) 0.0 – 1.0 Burying the hedge 500 words deep
Verbosity len(fields(raw)) 0 – ∞ 500 words to say "I don't know"
"Context files change agent behavior (more exploration, more reasoning tokens, more tool calls) without improving outcomes. The gap between behavioral change and outcome change is measurable."
Gloaguen, Music, and Poesia, "Evaluating AGENTS.md" (arXiv:2504.01441, 2025)

Confidence-Hedging Coherence

Coherence measures the agreement between an agent's self-reported confidence number and the hedging patterns in its linguistic output. When an agent writes "I'm not really sure about this, it's outside my expertise" and then reports CONFIDENCE: 85, the signals are contradictory. The agent is either miscalibrated in its self-assessment or using hedging language performatively without conviction.

The formula normalizes both signals to the same [0, 1] scale:

Definition
normalizedConf = confidence / 100
linguisticConf = 1.0 - hedgingScore
coherence = 1.0 - abs(normalizedConf - linguisticConf)
Figure 2: Coherence Score Distribution

Coherent responses cluster in two regions: the upper-right (confident language, high confidence number) and the lower-left (hedging language, low confidence number). Incoherent responses appear off-diagonal: agents that hedge heavily but report high confidence, or agents that write assertively but report low confidence.

The research question this metric enables: does incoherence predict incorrect answers? If low coherence correlates with factual errors on calibration probes where ground truth is known, coherence becomes a cheap proxy signal for reliability, requiring no ground-truth labels at inference time.

Decisiveness

Decisiveness measures how early in a response the agent reaches its signal. On boundary probes (out-of-scope questions), the ideal response is a fast hedge or refusal. An agent that writes three paragraphs of tangentially related content before admitting "but I'm not really sure about this" has low decisiveness: it recognized the boundary, but only as an afterthought.

Figure 3: Response Structure, Where Agents Place Their Signal

Decisiveness is measured only on boundary probes, where the position of the first hedging or refusal pattern is recorded as a fraction of total response length. A position near 0.0 indicates the agent immediately recognized the scope boundary, while a position near 1.0 indicates the hedge was buried deep in the response, appended as a postscript after paragraphs of tangential content.

Interpretation: On boundary probes, high decisiveness indicates the agent recognized the scope boundary immediately rather than discovering it mid-response. Low decisiveness combined with high verbosity indicates the agent is generating content on a topic it should be declining, consuming tokens and user attention before arriving at the hedge it could have led with.

Failure Mode Taxonomy

Behavioral traces allow us to distinguish three structurally different failure modes at scope boundaries, each with different implications for agent design.

Confident & Wrong
The agent answers immediately and assertively outside its scope, with high confidence, low hedging, and low verbosity. Nothing in the response signals that the question falls outside the agent's domain.
coherence: high, decisiveness: n/a, verbosity: low
Verbose Deflection
The agent writes hundreds of words with moderate hedging spread throughout and moderate confidence, consuming tokens without ever committing to an answer or a refusal.
coherence: low, decisiveness: low, verbosity: high
Late Hedge
The agent provides a full answer at high verbosity, then appends a hedging disclaimer in the final sentences. The boundary recognition exists but arrives as a structural afterthought.
coherence: moderate, decisiveness: low, verbosity: high
Figure 4: Failure Mode Landscape (Coherence vs. Decisiveness, sized by word count)

Each failure mode points to a different intervention: "confident and wrong" agents need stronger boundary instructions or structured refusal mechanisms, "verbose deflection" agents benefit from explicit conciseness constraints, and "late hedge" agents already have the boundary recognition capacity but lack the decisional priority to act on it before generating hundreds of tokens of tangential content.

Methodology

Data source. 496 agent definition files from the wshobson/agents open-source corpus, deduplicated by SHA-256 content hash to 428 unique agent primitives across 18 evaluation domains. The dataset and static analysis methodology are described in the State of the Agent census report.

Static analysis. Each system prompt is scored for domain keyword coverage (scope clarity), presence of scope-limiting language (boundary definition), and hedging/uncertainty instructions (uncertainty guidance). Fully deterministic, no LLM involved.

Runtime probing. 2,500 probe questions (boundary and calibration types) delivered to each agent as a system prompt injected into a fresh LLM context. Each probe runs once deterministically (temperature 0) and multiple times stochastically (temperature 0.7) to measure consistency. 10,000 total API calls via Llama 3.3 70B.

Behavioral scoring. Coherence, decisiveness, and verbosity are computed from the raw response text of the 10,000 probe responses. No additional API calls. The rescore command in agent-evals allows retroactive computation of behavioral metrics from existing report JSON or markdown transcripts.

Pipeline
Source files scanned496
Unique agents (post-dedup)428
Evaluation domains18
Probes generated2,500
Total LLM calls10,000
ModelLlama 3.3 70B
Stochastic runs per probe3
Pairwise overlaps computed74,714
Behavioral metrics3 (zero API cost)

Limitations. All runtime probes use a single model family (Llama 3.3 70B). Behavioral patterns may differ across model architectures, sizes, and alignment tuning approaches. The coherence metric depends on the agent reporting a numerical confidence value, which not all probed agents do consistently. Decisiveness is measured only on boundary probes and may not generalize to other probe types.

Conclusions

Findings
1
The say-do gap is systematic. 72% of agents claim boundaries in their system prompts, but the mean boundary respect score at runtime is 0.30 and refusal health is 0.04, indicating that boundary instructions have minimal effect on actual runtime behavior.
2
Behavioral signals distinguish failure modes that content-level evaluation misses. Surface-level behavioral signals like hedging patterns, confidence positioning, and response length distinguish structurally different failure modes that content-level evaluation conflates. A model that confidently executes out of scope is a different problem than one that hedges its way through a valid request, yet both may produce identical pass/fail scores.
3
Failure modes are distinguishable. Agents fail at boundaries in at least three structurally different ways, each suggesting different interventions. A uniform response such as "add more boundary language" cannot address all three patterns, because the underlying behavioral mechanisms differ.
4
Coherence as a proxy for reliability warrants further validation. If confidence-hedging incoherence correlates with factual errors on calibration probes, it becomes a deployable signal: flag responses where the agent's words and numbers disagree, without requiring reference answers at inference time.

Future work. Validate coherence as a predictive signal against ground-truth calibration data. Extend probing to multiple model families. Implement multi-turn boundary probes to test whether conversational drift degrades boundary compliance. Measure the effect of specific prompt interventions (explicit uncertainty instructions, role framing, boundary enumeration) on behavioral trace metrics.

References

  1. Gloaguen, E., Music, L., and Poesia, R. "Evaluating AGENTS.md." arXiv:2504.01441, 2025. arxiv.org/abs/2504.01441
  2. Hobson, W. S. "agents: Open-source collection of LLM agent system prompts." GitHub, 2024. github.com/wshobson/agents
  3. Huey, B. "State of the Agent: A Census of 428 LLM Agent Configurations." Thinkwright, 2026. thinkwright.ai/agent-census
  4. Huey, B. "agent-evals: Pluggable evaluation framework for LLM agent boundary compliance." GitHub, 2025. thinkwright.ai/agent-evals
Sources: wshobson/agents corpus, agent-evals framework, State of the Agent census, Gloaguen et al. 2025