Everything you need to evaluate, test, and maintain your LLM coding agent configurations. From first install to CI integration.
Working with a single coding agent means managing one system prompt. Working with five or fifteen means managing the gaps between them, and those gaps develop problems that individual agents don't have. Two agents might silently claim the same domain and give contradictory advice. An agent might confidently answer questions outside its scope because its prompt says "you are an expert" without saying where that expertise ends. An entire category of questions might fall between agents with no clear owner.
These problems tend to be invisible until a user runs into them. There's no built-in linter for system prompts and no test suite for whether an agent knows its own boundaries. agent-evals fills that gap. It reads your agent definitions in YAML, JSON, Markdown, or other formats and runs two kinds of analysis:
Static analysis catches structural issues with no API calls needed. It checks for scope overlap between agent pairs, coverage gaps where no agent owns a domain, missing boundary language, and contradictory instructions. It works like a linter for your agent definitions.
Live probes take things a step further by generating boundary questions tailored to each agent, focusing on topics the agent should hedge on or refuse, and sending them through your LLM provider. The tool then measures whether the agent actually hedges, how confident it claims to be, and whether its responses stay consistent across repeated runs. This is how you verify that your boundary definitions actually hold up at inference time.
The output is a report you can read in your terminal, pipe into CI as JSON, or paste into a PR as Markdown. Static checks are free and fast enough to run on every commit. Live probes are better suited to PRs or nightly runs, and budgets are configurable so you stay in control of cost.
# Go install go install github.com/thinkwright/agent-evals@latest # Or Homebrew brew install thinkwright/tap/agent-evals
agent-evals check ./agents/
agent-evals check -r ./plugins/
export ANTHROPIC_API_KEY=sk-ant-... agent-evals test ./agents/ --provider anthropic
Reads agent definitions from disk, extracts domains, computes overlap, flags conflicts and gaps. No API calls, no credentials required.
Everything in check, plus live boundary probes sent through your LLM provider. Measures real agent behavior.
Everything above is enough to get started. The sections below cover the full flag list, every supported agent format, configuration options, scoring formulas, and provider details. Skim what you need and skip the rest.
| Flag | Default | Description |
|---|---|---|
--ci | false | CI mode: forces JSON output, disables pager, exits 1 on threshold failure |
--format | terminal | Output format: terminal, json, markdown |
--config | auto | Path to agent-evals.yaml. Auto-discovered in the agent directory if not specified |
-o, --output | stdout | Write report to a file instead of stdout |
--no-pager | false | Disable automatic paging for terminal output |
-r, --recursive | false | Recursively scan nested directories for agent definitions. Deduplicates identical agents by content hash |
--no-dedup | false | Disable content-hash deduplication (only with --recursive) |
| Flag | Default | Description |
|---|---|---|
--provider | anthropic | LLM provider: anthropic, openai, openai-compatible |
--model | - | Model name for probes. Defaults to claude-sonnet-4-5-20250514 (Anthropic) or gpt-4o (OpenAI) |
--base-url | - | Base URL for openai-compatible providers (e.g. http://localhost:11434/v1) |
--api-key-env | auto | Environment variable name for API key. Defaults to ANTHROPIC_API_KEY or OPENAI_API_KEY |
--probe-budget | 500 | Maximum probes to generate. Actual API calls = budget × (1 + stochastic_runs) |
--stochastic-runs | 5 | Number of stochastic repetitions per probe at temperature 0.7 |
--concurrency | 3 | Maximum concurrent API calls |
Point agent-evals at a directory of agent definitions. It auto-detects the format of each file.
system_prompt or instructions, optional skills, rules, domains. ID inferred from filename if not set.name, skills, domain_tags. Markdown body becomes the system prompt.AGENT.md as system prompt, optional SKILLS.md and RULES.md. Bullet-point lists are automatically extracted.id: backend_api name: Backend API Engineer system_prompt: | You are a senior backend engineer specializing in Go microservices, PostgreSQL, and API design... skills: - PostgreSQL optimization - Go microservices - gRPC and REST API design rules: - Always use connection pooling - Prefer gRPC for internal services domains: - backend - databases - api_design
Create an agent-evals.yaml in your agents directory or pass it explicitly with --config. All fields are optional.
# Domains to analyze (omit to use all 18 built-ins) domains: - backend # built-in reference - frontend - databases - security # Extend a built-in domain with extra keywords - name: backend extends: builtin keywords: [axum, actix-web, tokio] # Add a fully custom domain - name: payments keywords: [payment gateway, stripe, plaid, ach transfer] # CI failure thresholds thresholds: max_overlap_score: 0.3 # Max pairwise overlap (30%) min_overall_score: 0.7 # Min overall pass score (70%) min_boundary_score: 0.5 # Min boundary awareness (50%) min_calibration_score: 0.6 # Min calibration (60%) max_refusal_suppression: 0.2 # Max inappropriate refusals # Live probe configuration probes: budget: 200 model: claude-sonnet-4-5-20250514 provider: anthropic api_key_env: ANTHROPIC_API_KEY stochastic_runs: 5 base_url: "" # For openai-compatible only
--config flag is passed, agent-evals looks for agent-evals.yaml in the target directory. If not found, built-in defaults are used.
The domains field controls which domains are analyzed. Each entry can be a string (built-in reference), a map that extends a built-in with extra keywords (extends: builtin), or a fully custom domain with its own keyword list. Omit domains entirely to use all 18 built-in domains.
backend, security)extends: builtinname and keywords fieldsSuppose your team has a payments agent but none of the 18 built-in domains cover payment processing. You can define a payments domain so that agent-evals knows to check for coverage, flag overlaps with other agents that mention Stripe, and generate boundary probes around payment topics.
domains: - backend - frontend - security - name: payments keywords: - payment gateway - stripe - plaid - ach transfer - pci compliance - checkout flow - billing - subscription - refund - invoicing
Keywords are matched case-insensitively against each agent's system prompt. The more keywords that appear in a prompt, the higher that agent scores for the domain. A few guidelines for choosing good keywords:
stripe and ach transfer over generic terms like money or pay that could match unrelated promptscheckout flow, pci compliance) with specific technologies (stripe, plaid) for broader coverageOnce defined, the custom domain works exactly like a built-in: it appears in overlap analysis, gap detection, and the terminal report. If no agent scores above 0.5 for payments, it shows up as a coverage gap.
Domain detection uses keyword matching against agent system prompts and declared domains fields. These 18 categories are built in, and you can extend them or add your own via the configuration file. The full keyword lists, probe details, and customization reference are documented in DOMAINS.md.
Static analysis reads your agent files, extracts what each agent claims to do, and checks whether your agents as a whole are consistent. It requires no API calls or credentials and runs in under a second. This is what the check command does, and it's the part you would typically run on every commit.
Under the hood, it performs five analysis passes:
Static analysis tells you what agents claim to do, and live probes tell you what they actually do. For example, if your backend agent's system prompt says "defer to the security team on auth questions," live probes check whether it actually defers when asked about OAuth token rotation or whether it confidently answers anyway.
The test command generates targeted boundary questions for each agent, focusing on the edges of its declared expertise, and sends them through your LLM provider. Each probe runs once deterministically and then multiple times with randomness so the tool can measure both what the agent says and how consistently it says it.
--probe-budget to control spend.
This section documents how each score is computed so you can interpret results and tune thresholds accordingly. The formulas are here for reference when a score looks unexpected and you want to understand what went into it.
| Severity | Category | Condition |
|---|---|---|
| ERROR | conflict | Opposing instructions detected between agents |
| WARNING | overlap | Overlap score exceeds max_overlap_score |
| WARNING | gap | Domain best score < 0.5 |
| INFO | boundary | Agent lacks boundary language |
| INFO | uncertainty | Agent lacks uncertainty guidance |
claude-sonnet-4-5-20250514. Set ANTHROPIC_API_KEY.gpt-4o. Set OPENAI_API_KEY.--base-url and --model.# Anthropic (default) agent-evals test ./agents/ --provider anthropic # OpenAI agent-evals test ./agents/ --provider openai --model gpt-4o # Ollama (local) agent-evals test ./agents/ \ --provider openai-compatible \ --model mistral:7b \ --base-url http://localhost:11434/v1 # Cerebras agent-evals test ./agents/ \ --provider openai-compatible \ --model llama3.1-70b \ --base-url https://api.cerebras.ai/v1 \ --api-key-env CEREBRAS_API_KEY
ANSI-colored output with progress bars, emoji severity, and automatic paging. Default format.
Machine-readable structured output. Ideal for CI artifact consumption and programmatic processing.
PR-comment friendly tables and emoji status. Paste directly into GitHub/GitLab reviews.
# Terminal (default, with pager) agent-evals check ./agents/ # JSON to file agent-evals check ./agents/ --format json -o report.json # Markdown to file agent-evals test ./agents/ --format markdown -o EVAL.md # CI mode (auto JSON, no pager, exit code 1 on failure) agent-evals check ./agents/ --ci
The --ci flag forces JSON output, disables the pager, and exits with code 1 when scores fall below thresholds.
name: Agent Evaluation on: [push, pull_request] jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-go@v5 with: go-version: '1.22' - name: Install agent-evals run: go install github.com/thinkwright/agent-evals@latest # Static analysis only (no API key needed) - name: Static check run: agent-evals check ./agents/ --ci # Full test with live probes (optional) - name: Live boundary test run: agent-evals test ./agents/ --ci --provider anthropic env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
| Code | Meaning |
|---|---|
0 | All checks passed, no threshold violations |
1 | Errors detected or CI thresholds not met (overall < 70%, boundary < 50%, or ERROR-level issues) |
check --ci on every push (free, instant) and reserve test --ci for PRs or nightly runs with a capped --probe-budget.
agent-evals check ./agents/
agent-evals check -r ./plugins/ # Without dedup (see every file individually) agent-evals check -r --no-dedup ./plugins/
agent-evals check ./agents/ --ci --output report.json
export ANTHROPIC_API_KEY=sk-ant-... agent-evals test ./agents/ \ --provider anthropic \ --model claude-sonnet-4-5-20250514 \ --probe-budget 500 \ --format markdown \ -o test_report.md
agent-evals test ./agents/ \ --provider openai-compatible \ --model mistral:7b \ --base-url http://localhost:11434/v1 \ --probe-budget 200
agent-evals test ./agents/ \ --config ./strict-eval.yaml \ --ci \ --output results.json