go install github.com/thinkwright/agent-evals@latest
agent-evals check ./agents/
brew install thinkwright/tap/agent-evals
curl -fsSL https://github.com/thinkwright/agent-evals/releases/latest/download/agent-evals_$(uname -s)_$(uname -m).tar.gz | tar xz
Agents Overlap
Overlap analysis, boundary testing, and metacognitive scoring for LLM coding agents.
Managing multiple coding agents means managing the gaps between them: two agents silently claiming the same domain, another confidently answering outside its scope, categories of questions with no owner. agent-evals catches these problems before your team does. Static analysis scans your system prompts for overlap, conflicts, and coverage gaps, while live probes verify that boundary definitions hold up at inference time.
Two Modes
No API calls, no credentials. Reads agent definitions from disk, extracts domains via keyword analysis, computes pairwise overlap using Jaccard similarity and LCS-based prompt comparison, and flags conflicts and coverage gaps.
- Domain extraction from system prompts across 18 built-in domains, extensible via config
- Pairwise overlap scoring using Jaccard similarity on domain sets and LCS-based prompt similarity
- Conflict detection via regex opposition pair matching
- Coverage gap identification for uncovered and weakly-covered domains
Everything in check, plus live boundary probes. Generates out-of-scope questions for each agent, sends them through your LLM provider, and scores refusal health and calibration.
- Boundary probes: out-of-scope questions that well-configured agents should hedge or refuse
- Calibration scoring: measures whether confidence levels match actual capability
- Refusal health: tracks appropriate hedging on questions agents shouldn't answer
- Consistency: runs each probe multiple times at temperature 0.7 to measure response variance
How It Works
agent-evals test, boundary probes are sent through your LLM provider. One deterministic pass at temperature 0, then stochastic runs at 0.7 to measure response variance.--ci for machine-friendly defaults.Any Provider
Live probes support Anthropic, OpenAI, and any OpenAI-compatible endpoint. Run against hosted models or locally via Ollama.
ANTHROPIC_API_KEY and go.--provider openai.--provider openai-compatible --base-url http://localhost:11434/v1.CI Integration
--ci outputs JSON, disables the pager, and exits with code 1 when scores fall below configurable thresholds.
# GitHub Actions - name: Evaluate agents run: agent-evals check ./agents/ --ci # With live probes - name: Test agent boundaries run: agent-evals test ./agents/ --ci --provider anthropic env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
min_overall_score and min_boundary_score in your agent-evals.yaml to control when CI fails. Defaults: 70% overall, 50% boundary.
Reference
| Score | What it measures |
|---|---|
| Boundary awareness | How often the agent correctly hedges on out-of-scope questions |
| Calibration | Whether stated confidence matches demonstrated accuracy |
| Refusal health | Whether the agent says "I don't know" when it should |
| Consistency | How stable responses are across repeated stochastic runs |
system_prompt or instructions, plus optional skills, rules, claimed_domainsname, skills, domain_tagsAGENT.md + optional SKILLS.md and RULES.md. Lists extracted from markdown bullet points