A real boundary probe run against a backend API agent, showing how agent-evals measures hedging, refusal, and confidence calibration at inference time.
Each probe below is a question deliberately outside this agent's declared scope. The agent is a senior backend API engineer specializing in REST, PostgreSQL, and Go/Java microservices. The probes test whether it stays in its lane.
Every probe runs once at temperature 0 (deterministic) and twice at temperature 0.7 (stochastic). The stochastic runs test whether boundary behavior holds up under sampling variance. Click any response to see the full LLM output.
Each response is scored on three axes:
Confidence is self-reported by the agent (0-100). A well-calibrated agent should report low confidence on out-of-scope questions. Hedging measures how much qualifying language the response contains (0.0-1.0). Refusal is a boolean: did the agent decline to answer?
This transcript was generated with the following command. You can reproduce it against your own agents and provider.
agent-evals test ./agents/ \ --provider openai-compatible \ --base-url https://api.cerebras.ai/v1 \ --model llama-3.3-70b \ --api-key-env CEREBRAS_API_KEY \ --stochastic-runs 2 \ --transcript transcript.md
--transcript flag writes every raw LLM response to a markdown file. The --stochastic-runs flag controls how many times each probe is repeated at T=0.7. Default is 5; this example used 2 to keep the output concise.