Everything you need to evaluate, test, and maintain your LLM coding agent configurations. From first install to CI integration.
Working with a single LLM agent means managing one system prompt. Working with five or fifteen means managing a fleet, and fleets develop problems that individual agents don't have. Two agents might silently claim the same domain and give contradictory advice. An agent might confidently answer questions outside its scope because its prompt says "you are an expert" without saying where that expertise ends. An entire category of questions might fall between agents with no clear owner.
These problems tend to be invisible until a user runs into them. There's no built-in linter for system prompts and no test suite for whether an agent knows its own boundaries. agent-evals fills that gap. It reads your agent definitions in YAML, JSON, Markdown, or other formats and runs two kinds of analysis:
Static analysis catches structural issues with no API calls needed. It checks for scope overlap between agent pairs, coverage gaps where no agent owns a domain, missing boundary language, and contradictory instructions. It works like a linter for your agent fleet.
Live probes take things a step further by generating boundary questions tailored to each agent, focusing on topics the agent should hedge on or refuse, and sending them through your LLM provider. The tool then measures whether the agent actually hedges, how confident it claims to be, and whether its responses stay consistent across repeated runs. This is how you verify that your boundary definitions actually hold up at inference time.
The output is a report you can read in your terminal, pipe into CI as JSON, or paste into a PR as Markdown. Static checks are free and fast enough to run on every commit. Live probes are better suited to PRs or nightly runs, and budgets are configurable so you stay in control of cost.
# Go install go install github.com/thinkwright/agent-evals@latest # Or Homebrew brew install thinkwright/tap/agent-evals
agent-evals check ./agents/
export ANTHROPIC_API_KEY=sk-ant-... agent-evals test ./agents/ --provider anthropic
Reads agent definitions from disk, extracts domains, computes overlap, flags conflicts and gaps. No API calls, no credentials required.
Everything in check, plus live boundary probes sent through your LLM provider. Measures real agent behavior.
Everything above is enough to get started. The sections below cover the full flag list, every supported agent format, configuration options, scoring formulas, and provider details. Skim what you need and skip the rest.
| Flag | Default | Description |
|---|---|---|
--ci | false | CI mode: forces JSON output, disables pager, exits 1 on threshold failure |
--format | terminal | Output format: terminal, json, markdown |
--config | auto | Path to agent-evals.yaml. Auto-discovered in the agent directory if not specified |
-o, --output | stdout | Write report to a file instead of stdout |
--no-pager | false | Disable automatic paging for terminal output |
| Flag | Default | Description |
|---|---|---|
--provider | anthropic | LLM provider: anthropic, openai, openai-compatible |
--model | - | Model name for probes. Defaults to claude-sonnet-4-5-20250514 (Anthropic) or gpt-4o (OpenAI) |
--base-url | - | Base URL for openai-compatible providers (e.g. http://localhost:11434/v1) |
--api-key-env | auto | Environment variable name for API key. Defaults to ANTHROPIC_API_KEY or OPENAI_API_KEY |
--probe-budget | 500 | Maximum probes to generate. Actual API calls = budget × (1 + stochastic_runs) |
--stochastic-runs | 5 | Number of stochastic repetitions per probe at temperature 0.7 |
--concurrency | 3 | Maximum concurrent API calls |
Point agent-evals at a directory of agent definitions. It auto-detects the format of each file.
system_prompt or instructions, optional skills, rules, domains. ID inferred from filename if not set.name, skills, domain_tags. Markdown body becomes the system prompt.AGENT.md as system prompt, optional SKILLS.md and RULES.md. Bullet-point lists are automatically extracted.id: backend_api name: Backend API Engineer system_prompt: | You are a senior backend engineer specializing in Go microservices, PostgreSQL, and API design... skills: - PostgreSQL optimization - Go microservices - gRPC and REST API design rules: - Always use connection pooling - Prefer gRPC for internal services domains: - backend - databases - api_design
Create an agent-evals.yaml in your agents directory or pass it explicitly with --config. All fields are optional.
# Domains your fleet should cover domains: - backend - frontend - devops - databases - security - testing # CI failure thresholds thresholds: max_overlap_score: 0.3 # Max pairwise overlap (30%) min_overall_score: 0.7 # Min overall pass score (70%) min_boundary_score: 0.5 # Min boundary awareness (50%) min_calibration_score: 0.6 # Min calibration (60%) max_refusal_suppression: 0.2 # Max inappropriate refusals # Live probe configuration probes: budget: 200 model: claude-sonnet-4-5-20250514 provider: anthropic api_key_env: ANTHROPIC_API_KEY stochastic_runs: 5 base_url: "" # For openai-compatible only
--config flag is passed, agent-evals looks for agent-evals.yaml in the target directory. If not found, built-in defaults are used.
Domain detection uses keyword matching against agent system prompts and declared domains fields. These 19 categories are built in. The full keyword lists and probe details are documented in DOMAINS.md.
Static analysis reads your agent files, extracts what each agent claims to do, and checks whether the fleet as a whole is consistent. It requires no API calls or credentials and runs in under a second. This is what the check command does, and it's the part you would typically run on every commit.
Under the hood, it performs five analysis passes:
Static analysis tells you what agents claim to do, and live probes tell you what they actually do. For example, if your backend agent's system prompt says "defer to the security team on auth questions," live probes check whether it actually defers when asked about OAuth token rotation or whether it confidently answers anyway.
The test command generates targeted boundary questions for each agent, focusing on the edges of its declared expertise, and sends them through your LLM provider. Each probe runs once deterministically and then multiple times with randomness so the tool can measure both what the agent says and how consistently it says it.
--probe-budget to control spend.
This section documents how each score is computed so you can interpret results and tune thresholds accordingly. The formulas are here for reference when a score looks unexpected and you want to understand what went into it.
| Severity | Category | Condition |
|---|---|---|
| ERROR | conflict | Opposing instructions detected between agents |
| WARNING | overlap | Overlap score exceeds max_overlap_score |
| WARNING | gap | Domain best score < 0.5 |
| INFO | boundary | Agent lacks boundary language |
| INFO | uncertainty | Agent lacks uncertainty guidance |
claude-sonnet-4-5-20250514. Set ANTHROPIC_API_KEY.gpt-4o. Set OPENAI_API_KEY.--base-url and --model.# Anthropic (default) agent-evals test ./agents/ --provider anthropic # OpenAI agent-evals test ./agents/ --provider openai --model gpt-4o # Ollama (local) agent-evals test ./agents/ \ --provider openai-compatible \ --model mistral:7b \ --base-url http://localhost:11434/v1 # Cerebras agent-evals test ./agents/ \ --provider openai-compatible \ --model llama3.1-70b \ --base-url https://api.cerebras.ai/v1 \ --api-key-env CEREBRAS_API_KEY
ANSI-colored output with progress bars, emoji severity, and automatic paging. Default format.
Machine-readable structured output. Ideal for CI artifact consumption and programmatic processing.
PR-comment friendly tables and emoji status. Paste directly into GitHub/GitLab reviews.
# Terminal (default, with pager) agent-evals check ./agents/ # JSON to file agent-evals check ./agents/ --format json -o report.json # Markdown to file agent-evals test ./agents/ --format markdown -o EVAL.md # CI mode (auto JSON, no pager, exit code 1 on failure) agent-evals check ./agents/ --ci
The --ci flag forces JSON output, disables the pager, and exits with code 1 when scores fall below thresholds.
name: Agent Evaluation on: [push, pull_request] jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-go@v5 with: go-version: '1.22' - name: Install agent-evals run: go install github.com/thinkwright/agent-evals@latest # Static analysis only (no API key needed) - name: Static check run: agent-evals check ./agents/ --ci # Full test with live probes (optional) - name: Live boundary test run: agent-evals test ./agents/ --ci --provider anthropic env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
| Code | Meaning |
|---|---|
0 | All checks passed, no threshold violations |
1 | Errors detected or CI thresholds not met (overall < 70%, boundary < 50%, or ERROR-level issues) |
check --ci on every push (free, instant) and reserve test --ci for PRs or nightly runs with a capped --probe-budget.
agent-evals check ./agents/
agent-evals check ./agents/ --ci --output report.json
export ANTHROPIC_API_KEY=sk-ant-... agent-evals test ./agents/ \ --provider anthropic \ --model claude-sonnet-4-5-20250514 \ --probe-budget 500 \ --format markdown \ -o test_report.md
agent-evals test ./agents/ \ --provider openai-compatible \ --model mistral:7b \ --base-url http://localhost:11434/v1 \ --probe-budget 200
agent-evals test ./agents/ \ --config ./strict-eval.yaml \ --ci \ --output results.json