Confidence Without Calibration

428 agents

10,000 LLM calls

72% claim boundaries

0.04 mean refusal rate

Abstract

LLM agents ship with scope boundaries (explicit definitions of what they should and should not do), yet runtime behavior consistently ignores them. Building on the dataset and static analysis from the State of the Agent census, we analyze 428 open-source agent configurations across 18 evaluation domains, measuring whether boundary instructions translate into actual behavioral compliance at runtime.

Using a combination of static analysis and 10,000 live LLM probe calls, we find a systematic disconnect: 72% of agents include boundary language in their system prompts, yet produce a mean boundary respect score of just 0.30 and a refusal health of 0.04. Agents confidently answer questions far outside their stated expertise.

We introduce three behavioral trace metrics, coherence, decisiveness, and verbosity, that measure the structural pattern of a response rather than its content. These signals require no additional API calls and no ground-truth labels, computed entirely from the existing response text. Across the full corpus, behavioral incoherence (disagreement between self-reported confidence and linguistic hedging) appears systematic, present in every domain cluster we measured, and correlates with the lowest boundary respect scores in the dataset.

Core finding: The gap between what agents are told to do and what they actually do is measurable, consistent, and visible in behavioral traces that are cheap to compute. Rather than refusing out-of-scope questions, agents answer them with high stated confidence and no linguistic hesitation.

The Say-Do Gap

The term "say-do gap" describes the disconnect between an agent's declared capabilities and its runtime behavior. The State of the Agent census first quantified this pattern through static analysis of 428 agent system prompts: developers write careful scope definitions, expecting agents to stay within boundaries, but the runtime measurements in Table 1 show those instructions have limited effect on actual behavior.

Table 1 — Static Analysis vs. Runtime Behavior (n=428)

Metric	What It Measures	Mean	Median	Signal
Scope Clarity	Domain keyword coverage in system prompt	0.84	1.00	Agents know what they do
Boundary Definition	Presence of scope-limiting language	0.59	0.70	Most set boundaries
Uncertainty Guidance	Hedging/uncertainty instructions	0.35	0.30	Few address "I don't know"
Boundary Respect	Actually hedges on out-of-scope probes	0.293	0.333	Boundaries ignored at runtime
Calibration	Correctly answers in-scope questions	0.960	1.000	Competent within scope
Consistency	Stable across stochastic runs	0.829	0.929	Failures are systematic, not stochastic
Refusal Health	Declines out-of-scope questions	0.042	0.000	Near-zero refusal rate

Agents score 0.84 on scope clarity (they know what they do) but 0.042 on refusal health (they almost never decline out-of-scope questions). Their 0.96 calibration score confirms they answer in-scope questions correctly, while 0.83 consistency shows that boundary failures are systematic rather than stochastic: the same agent, given the same out-of-scope probe across multiple runs, produces the same confident wrong answer each time.

Figure 1: The Say-Do Gap

Boundary Claim Rate (Static)

72%

of agents include scope-limiting language

Boundary Respect (Runtime)

30%

mean score on out-of-scope probes

Refusal Rate

of out-of-scope probes refused

Consistency

83%

failures are systematic, not stochastic

Behavioral Trace Metrics

Content-level metrics (confidence number, hedging phrases, explicit refusal) capture what an agent says, whereas behavioral trace metrics capture the structural pattern of the response itself: internal consistency between stated confidence and linguistic hedging, the position of hesitation signals within the text, and the economy of expression. This approach draws on work by Gloaguen, Music, and Poesia [1], who found that context files change agent behavior (more exploration, more reasoning tokens, more tool calls) without improving outcomes.

We introduce three behavioral metrics that require zero additional API calls. Each is computed from the raw response text already collected during live probing.

Table 2 — Behavioral Trace Metrics

Metric	Formula	Range	What It Catches
Coherence	1.0 - \|normConf - lingConf\|	0.0 – 1.0	"I'm not sure" + CONFIDENCE: 90
Decisiveness	1.0 - (firstHedgePos / len)	0.0 – 1.0	Burying the hedge 500 words deep
Verbosity	len(fields(raw))	0 – ∞	500 words to say "I don't know"

"Context files change agent behavior (more exploration, more reasoning tokens, more tool calls) without improving outcomes. The gap between behavioral change and outcome change is measurable."

Gloaguen, Music, and Poesia, "Evaluating AGENTS.md" (arXiv:2504.01441, 2025)

Confidence-Hedging Coherence

Coherence measures the agreement between an agent's self-reported confidence number and the hedging patterns in its linguistic output. When an agent writes "I'm not really sure about this, it's outside my expertise" and then reports CONFIDENCE: 85, the signals are contradictory. The agent is either miscalibrated in its self-assessment or using hedging language performatively without conviction.

The formula normalizes both signals to the same [0, 1] scale:

Definition


                normalizedConf = confidence / 100

                linguisticConf = 1.0 - hedgingScore

                coherence = 1.0 - abs(normalizedConf - linguisticConf)

Figure 2: Coherence Score Distribution

Coherent responses cluster in two regions: the upper-right (confident language, high confidence number) and the lower-left (hedging language, low confidence number). Incoherent responses appear off-diagonal: agents that hedge heavily but report high confidence, or agents that write assertively but report low confidence.

The research question this metric enables: does incoherence predict incorrect answers? If low coherence correlates with factual errors on calibration probes where ground truth is known, coherence becomes a cheap proxy signal for reliability, requiring no ground-truth labels at inference time.

Decisiveness

Decisiveness measures how early in a response the agent reaches its signal. On boundary probes (out-of-scope questions), the ideal response is a fast hedge or refusal. An agent that writes three paragraphs of tangentially related content before admitting "but I'm not really sure about this" has low decisiveness: it recognized the boundary, but only as an afterthought.

Figure 3: Response Structure, Where Agents Place Their Signal

Decisiveness is measured only on boundary probes, where the position of the first hedging or refusal pattern is recorded as a fraction of total response length. A position near 0.0 indicates the agent immediately recognized the scope boundary, while a position near 1.0 indicates the hedge was buried deep in the response, appended as a postscript after paragraphs of tangential content.

Interpretation: On boundary probes, high decisiveness indicates the agent recognized the scope boundary immediately rather than discovering it mid-response. Low decisiveness combined with high verbosity indicates the agent is generating content on a topic it should be declining, consuming tokens and user attention before arriving at the hedge it could have led with.

Failure Mode Taxonomy

Behavioral traces allow us to distinguish three structurally different failure modes at scope boundaries, each with different implications for agent design.

●

Confident & Wrong

The agent answers immediately and assertively outside its scope, with high confidence, low hedging, and low verbosity. Nothing in the response signals that the question falls outside the agent's domain.

coherence: high, decisiveness: n/a, verbosity: low

●

Verbose Deflection

The agent writes hundreds of words with moderate hedging spread throughout and moderate confidence, consuming tokens without ever committing to an answer or a refusal.

coherence: low, decisiveness: low, verbosity: high

●

Late Hedge

The agent provides a full answer at high verbosity, then appends a hedging disclaimer in the final sentences. The boundary recognition exists but arrives as a structural afterthought.

coherence: moderate, decisiveness: low, verbosity: high

Figure 4: Failure Mode Landscape (Coherence vs. Decisiveness, sized by word count)

Each failure mode points to a different intervention: "confident and wrong" agents need stronger boundary instructions or structured refusal mechanisms, "verbose deflection" agents benefit from explicit conciseness constraints, and "late hedge" agents already have the boundary recognition capacity but lack the decisional priority to act on it before generating hundreds of tokens of tangential content.

Methodology

Data source. 496 agent definition files from the wshobson/agents open-source corpus, deduplicated by SHA-256 content hash to 428 unique agent primitives across 18 evaluation domains. The dataset and static analysis methodology are described in the State of the Agent census report.

Static analysis. Each system prompt is scored for domain keyword coverage (scope clarity), presence of scope-limiting language (boundary definition), and hedging/uncertainty instructions (uncertainty guidance). Fully deterministic, no LLM involved.

Runtime probing. 2,500 probe questions (boundary and calibration types) delivered to each agent as a system prompt injected into a fresh LLM context. Each probe runs once deterministically (temperature 0) and multiple times stochastically (temperature 0.7) to measure consistency. 10,000 total API calls via Llama 3.3 70B.

Behavioral scoring. Coherence, decisiveness, and verbosity are computed from the raw response text of the 10,000 probe responses. No additional API calls. The rescore command in agent-evals allows retroactive computation of behavioral metrics from existing report JSON or markdown transcripts.

Pipeline

Source files scanned496

Unique agents (post-dedup)428

Evaluation domains18

Probes generated2,500

Total LLM calls10,000

ModelLlama 3.3 70B

Stochastic runs per probe3

Pairwise overlaps computed74,714

Behavioral metrics3 (zero API cost)

Limitations. All runtime probes use a single model family (Llama 3.3 70B). Behavioral patterns may differ across model architectures, sizes, and alignment tuning approaches. The coherence metric depends on the agent reporting a numerical confidence value, which not all probed agents do consistently. Decisiveness is measured only on boundary probes and may not generalize to other probe types.

Conclusions

Findings

The say-do gap is systematic. 72% of agents claim boundaries in their system prompts, but the mean boundary respect score at runtime is 0.30 and refusal health is 0.04, indicating that boundary instructions have minimal effect on actual runtime behavior.

Behavioral signals distinguish failure modes that content-level evaluation misses. Surface-level behavioral signals like hedging patterns, confidence positioning, and response length distinguish structurally different failure modes that content-level evaluation conflates. A model that confidently executes out of scope is a different problem than one that hedges its way through a valid request, yet both may produce identical pass/fail scores.

Failure modes are distinguishable. Agents fail at boundaries in at least three structurally different ways, each suggesting different interventions. A uniform response such as "add more boundary language" cannot address all three patterns, because the underlying behavioral mechanisms differ.

Coherence as a proxy for reliability warrants further validation. If confidence-hedging incoherence correlates with factual errors on calibration probes, it becomes a deployable signal: flag responses where the agent's words and numbers disagree, without requiring reference answers at inference time.

Future work. Validate coherence as a predictive signal against ground-truth calibration data. Extend probing to multiple model families. Implement multi-turn boundary probes to test whether conversational drift degrades boundary compliance. Measure the effect of specific prompt interventions (explicit uncertainty instructions, role framing, boundary enumeration) on behavioral trace metrics.

References

Gloaguen, E., Music, L., and Poesia, R. "Evaluating AGENTS.md." arXiv:2504.01441, 2025. arxiv.org/abs/2504.01441
Hobson, W. S. "agents: Open-source collection of LLM agent system prompts." GitHub, 2024. github.com/wshobson/agents
Huey, B. "State of the Agent: A Census of 428 LLM Agent Configurations." Thinkwright, 2026. thinkwright.ai/agent-census
Huey, B. "agent-evals: Pluggable evaluation framework for LLM agent boundary compliance." GitHub, 2025. thinkwright.ai/agent-evals

Sources: wshobson/agents corpus, agent-evals framework, State of the Agent census, Gloaguen et al. 2025