What is agent-evals?
Working with a single coding agent means managing one system prompt. Working with five or fifteen means managing the gaps between them, and those gaps develop problems that individual agents don't have. Two agents might silently claim the same domain and give contradictory advice. An agent might confidently answer questions outside its scope because its prompt says "you are an expert" without saying where that expertise ends. An entire category of questions might fall between agents with no clear owner.
These problems tend to be invisible until a user runs into them. There's no built-in linter for system prompts and no test suite for whether an agent knows its own boundaries. agent-evals fills that gap. It reads your agent definitions in YAML, JSON, Markdown, or other formats and runs two kinds of analysis:
Static analysis catches structural issues with no API calls needed. It checks for scope overlap between agent pairs, coverage gaps where no agent owns a domain, missing boundary language, and contradictory instructions. It works like a linter for your agent definitions.
Live probes take things a step further by generating boundary questions tailored to each agent, focusing on topics the agent should hedge on or refuse, and sending them through your LLM provider. The tool then measures whether the agent actually hedges, how confident it claims to be, and whether its responses stay consistent across repeated runs. This is how you verify that your boundary definitions actually hold up at inference time.
The output is a report you can read in your terminal, pipe into CI as JSON, or paste into a PR as Markdown. Static checks are free and fast enough to run on every commit. Live probes are better suited to PRs or nightly runs, and budgets are configurable so you stay in control of cost.
Quick Start
# Go install go install github.com/thinkwright/agent-evals@latest # Or Homebrew brew install thinkwright/tap/agent-evals
agent-evals check ./agents/
agent-evals check -r ./plugins/
export ANTHROPIC_API_KEY=sk-ant-... agent-evals test ./agents/ --provider anthropic
Commands
Reads agent definitions from disk, extracts domains, computes overlap, flags conflicts and gaps. No API calls, no credentials required.
- Domain extraction across 18 built-in categories
- Pairwise overlap via Jaccard + LCS
- Conflict detection for contradictory instructions
- Coverage gap identification
- Boundary and uncertainty language scoring
Everything in check, plus live boundary probes sent through your LLM provider. Measures real agent behavior.
- Boundary probes: out-of-scope questions agents should hedge on
- Calibration scoring: confidence compared to actual capability
- Refusal health: appropriate hedging on unknown topics
- Consistency: response variance across stochastic runs
Reference
Everything above is enough to get started. The sections below cover the full flag list, every supported agent format, configuration options, scoring formulas, and provider details. Skim what you need and skip the rest.
CLI Flags
Shared flags (check & test)
| Flag | Default | Description |
|---|---|---|
--ci | false | CI mode: forces JSON output, disables pager, exits 1 on threshold failure |
--format | terminal | Output format: terminal, json, markdown |
--config | auto | Path to agent-evals.yaml. Auto-discovered in the agent directory if not specified |
-o, --output | stdout | Write report to a file instead of stdout |
--no-pager | false | Disable automatic paging for terminal output |
-r, --recursive | false | Recursively scan nested directories for agent definitions. Deduplicates identical agents by content hash |
--no-dedup | false | Disable content-hash deduplication (only with --recursive) |
Test-only flags
| Flag | Default | Description |
|---|---|---|
--provider | anthropic | LLM provider: anthropic, openai, openai-compatible |
--model | - | Model name for probes. Defaults to claude-sonnet-4-5-20250514 (Anthropic) or gpt-4o (OpenAI) |
--base-url | - | Base URL for openai-compatible providers (e.g. http://localhost:11434/v1) |
--api-key-env | auto | Environment variable name for API key. Defaults to ANTHROPIC_API_KEY or OPENAI_API_KEY |
--probe-budget | 500 | Maximum probes to generate. Actual API calls = budget × (1 + stochastic_runs) |
--stochastic-runs | 5 | Number of stochastic repetitions per probe at temperature 0.7 |
--concurrency | 3 | Maximum concurrent API calls |
Agent Definition Formats
Point agent-evals at a directory of agent definitions. It auto-detects the format of each file.
system_prompt or instructions, optional skills, rules, domains. ID inferred from filename if not set.name, skills, domain_tags. Markdown body becomes the system prompt.AGENT.md as system prompt, optional SKILLS.md and RULES.md. Bullet-point lists are automatically extracted.id: backend_api name: Backend API Engineer system_prompt: | You are a senior backend engineer specializing in Go microservices, PostgreSQL, and API design... skills: - PostgreSQL optimization - Go microservices - gRPC and REST API design rules: - Always use connection pooling - Prefer gRPC for internal services domains: - backend - databases - api_design
Configuration
Create an agent-evals.yaml in your agents directory or pass it explicitly with --config. All fields are optional.
# Domains to analyze (omit to use all 18 built-ins) domains: - backend # built-in reference - frontend - databases - security # Extend a built-in domain with extra keywords - name: backend extends: builtin keywords: [axum, actix-web, tokio] # Add a fully custom domain - name: payments keywords: [payment gateway, stripe, plaid, ach transfer] # CI failure thresholds thresholds: max_overlap_score: 0.3 # Max pairwise overlap (30%) min_overall_score: 0.7 # Min overall pass score (70%) min_boundary_score: 0.5 # Min boundary awareness (50%) min_calibration_score: 0.6 # Min calibration (60%) max_refusal_suppression: 0.2 # Max inappropriate refusals # Live probe configuration probes: budget: 200 model: claude-sonnet-4-5-20250514 provider: anthropic api_key_env: ANTHROPIC_API_KEY stochastic_runs: 5 base_url: "" # For openai-compatible only
--config flag is passed, agent-evals looks for agent-evals.yaml in the target directory. If not found, built-in defaults are used.
Domain configuration
The domains field controls which domains are analyzed. Each entry can be a string (built-in reference), a map that extends a built-in with extra keywords (extends: builtin), or a fully custom domain with its own keyword list. Omit domains entirely to use all 18 built-in domains.
- String reference — selects a built-in domain by name (e.g.
backend,security) - Extend built-in — merges your keywords onto the built-in keyword list using
extends: builtin - Custom domain — defines a new domain with
nameandkeywordsfields - Unknown string references are skipped with a stderr warning
- Duplicate domain names: last entry wins
Creating a custom domain
Suppose your team has a payments agent but none of the 18 built-in domains cover payment processing. You can define a payments domain so that agent-evals knows to check for coverage, flag overlaps with other agents that mention Stripe, and generate boundary probes around payment topics.
domains: - backend - frontend - security - name: payments keywords: - payment gateway - stripe - plaid - ach transfer - pci compliance - checkout flow - billing - subscription - refund - invoicing
Keywords are matched case-insensitively against each agent's system prompt. The more keywords that appear in a prompt, the higher that agent scores for the domain. A few guidelines for choosing good keywords:
- Be specific. Prefer
stripeandach transferover generic terms likemoneyorpaythat could match unrelated prompts - Include both concepts and tools. Mix domain concepts (
checkout flow,pci compliance) with specific technologies (stripe,plaid) for broader coverage - Use 5–15 keywords. Too few and the domain won't match reliably; too many dilutes the signal. The built-in domains average around 12 keywords each
Once defined, the custom domain works exactly like a built-in: it appears in overlap analysis, gap detection, and the terminal report. If no agent scores above 0.5 for payments, it shows up as a coverage gap.
Recognized Domains
Domain detection uses keyword matching against agent system prompts and declared domains fields. These 18 categories are built in, and you can extend them or add your own via the configuration file. The full keyword lists, probe details, and customization reference are documented in DOMAINS.md.
Software Engineering
Non-Technical
Static Analysis
Static analysis reads your agent files, extracts what each agent claims to do, and checks whether your agents as a whole are consistent. It requires no API calls or credentials and runs in under a second. This is what the check command does, and it's the part you would typically run on every commit.
Under the hood, it performs five analysis passes:
Live Probes
Static analysis tells you what agents claim to do, and live probes tell you what they actually do. For example, if your backend agent's system prompt says "defer to the security team on auth questions," live probes check whether it actually defers when asked about OAuth token rotation or whether it confidently answers anyway.
The test command generates targeted boundary questions for each agent, focusing on the edges of its declared expertise, and sends them through your LLM provider. Each probe runs once deterministically and then multiple times with randomness so the tool can measure both what the agent says and how consistently it says it.
--probe-budget to control spend.
Scoring Reference
This section documents how each score is computed so you can interpret results and tune thresholds accordingly. The formulas are here for reference when a score looks unexpected and you want to understand what went into it.
Static scores (per agent)
Live probe scores (per agent)
Issue severity levels
| Severity | Category | Condition |
|---|---|---|
| ERROR | conflict | Opposing instructions detected between agents |
| WARNING | overlap | Overlap score exceeds max_overlap_score |
| WARNING | gap | Domain best score < 0.5 |
| INFO | boundary | Agent lacks boundary language |
| INFO | uncertainty | Agent lacks uncertainty guidance |
Providers
claude-sonnet-4-5-20250514. Set ANTHROPIC_API_KEY.gpt-4o. Set OPENAI_API_KEY.--base-url and --model.# Anthropic (default) agent-evals test ./agents/ --provider anthropic # OpenAI agent-evals test ./agents/ --provider openai --model gpt-4o # Ollama (local) agent-evals test ./agents/ \ --provider openai-compatible \ --model mistral:7b \ --base-url http://localhost:11434/v1 # Cerebras agent-evals test ./agents/ \ --provider openai-compatible \ --model llama3.1-70b \ --base-url https://api.cerebras.ai/v1 \ --api-key-env CEREBRAS_API_KEY
Output Formats
ANSI-colored output with progress bars, emoji severity, and automatic paging. Default format.
Machine-readable structured output. Ideal for CI artifact consumption and programmatic processing.
PR-comment friendly tables and emoji status. Paste directly into GitHub/GitLab reviews.
# Terminal (default, with pager) agent-evals check ./agents/ # JSON to file agent-evals check ./agents/ --format json -o report.json # Markdown to file agent-evals test ./agents/ --format markdown -o EVAL.md # CI mode (auto JSON, no pager, exit code 1 on failure) agent-evals check ./agents/ --ci
CI Integration
The --ci flag forces JSON output, disables the pager, and exits with code 1 when scores fall below thresholds.
name: Agent Evaluation on: [push, pull_request] jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-go@v5 with: go-version: '1.22' - name: Install agent-evals run: go install github.com/thinkwright/agent-evals@latest # Static analysis only (no API key needed) - name: Static check run: agent-evals check ./agents/ --ci # Full test with live probes (optional) - name: Live boundary test run: agent-evals test ./agents/ --ci --provider anthropic env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Exit codes
| Code | Meaning |
|---|---|
0 | All checks passed, no threshold violations |
1 | Errors detected or CI thresholds not met (overall < 70%, boundary < 50%, or ERROR-level issues) |
check --ci on every push (free, instant) and reserve test --ci for PRs or nightly runs with a capped --probe-budget.
Examples
agent-evals check ./agents/
agent-evals check -r ./plugins/ # Without dedup (see every file individually) agent-evals check -r --no-dedup ./plugins/
agent-evals check ./agents/ --ci --output report.json
export ANTHROPIC_API_KEY=sk-ant-... agent-evals test ./agents/ \ --provider anthropic \ --model claude-sonnet-4-5-20250514 \ --probe-budget 500 \ --format markdown \ -o test_report.md
agent-evals test ./agents/ \ --provider openai-compatible \ --model mistral:7b \ --base-url http://localhost:11434/v1 \ --probe-budget 200
agent-evals test ./agents/ \ --config ./strict-eval.yaml \ --ci \ --output results.json