| Score | What it measures |
|---|---|
| Boundary awareness | How often the agent correctly hedges on out-of-scope questions |
| Calibration | Whether stated confidence matches demonstrated accuracy |
| Refusal health | Whether the agent says "I don't know" when it should |
| Consistency | How stable responses are across repeated stochastic runs |
system_prompt or instructions, plus optional skills, rules, claimed_domainsname, skills, domain_tagsAGENT.md + optional SKILLS.md and RULES.md. Lists extracted from markdown bullet pointsgo install github.com/thinkwright/agent-evals@latest
agent-evals check ./agents/
brew install thinkwright/tap/agent-evals
curl -fsSL https://github.com/thinkwright/agent-evals/releases/latest/download/agent-evals_$(uname -s)_$(uname -m).tar.gz | tar xz
Detect scope overlap between agents.
Score boundary awareness and calibration.
Find coverage gaps across your fleet.
Test with live LLM probes or static analysis alone.
No API calls, no credentials. Reads agent definitions from disk, extracts domains via keyword analysis, computes pairwise overlap using Jaccard similarity and LCS-based prompt comparison, and flags conflicts and coverage gaps.
Everything in check, plus live boundary probes. Generates out-of-scope questions for each agent, sends them through your LLM provider, and scores refusal health and calibration.
agent-evals test, boundary probes are sent through your LLM provider. One deterministic pass at temperature 0, then stochastic runs at 0.7 to measure response variance.--ci for machine-friendly defaults.Live probes support Anthropic, OpenAI, and any OpenAI-compatible endpoint. Run against hosted models or locally via Ollama.
ANTHROPIC_API_KEY and go.--provider openai.--provider openai-compatible --base-url http://localhost:11434/v1.--ci outputs JSON, disables the pager, and exits with code 1 when scores fall below configurable thresholds.
# GitHub Actions - name: Evaluate agents run: agent-evals check ./agents/ --ci # With live probes - name: Test agent boundaries run: agent-evals test ./agents/ --ci --provider anthropic env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
min_overall_score and min_boundary_score in your agent-evals.yaml to control when CI fails. Defaults: 70% overall, 50% boundary.