Documentation - agent-evals

What is agent-evals?

Working with a single LLM agent means managing one system prompt. Working with five or fifteen means managing a fleet, and fleets develop problems that individual agents don't have. Two agents might silently claim the same domain and give contradictory advice. An agent might confidently answer questions outside its scope because its prompt says "you are an expert" without saying where that expertise ends. An entire category of questions might fall between agents with no clear owner.

These problems tend to be invisible until a user runs into them. There's no built-in linter for system prompts and no test suite for whether an agent knows its own boundaries. agent-evals fills that gap. It reads your agent definitions in YAML, JSON, Markdown, or other formats and runs two kinds of analysis:

Static analysis catches structural issues with no API calls needed. It checks for scope overlap between agent pairs, coverage gaps where no agent owns a domain, missing boundary language, and contradictory instructions. It works like a linter for your agent fleet.

Live probes take things a step further by generating boundary questions tailored to each agent, focusing on topics the agent should hedge on or refuse, and sending them through your LLM provider. The tool then measures whether the agent actually hedges, how confident it claims to be, and whether its responses stay consistent across repeated runs. This is how you verify that your boundary definitions actually hold up at inference time.

The output is a report you can read in your terminal, pipe into CI as JSON, or paste into a PR as Markdown. Static checks are free and fast enough to run on every commit. Live probes are better suited to PRs or nightly runs, and budgets are configurable so you stay in control of cost.

Quick Start

Install

# Go install
go install github.com/thinkwright/agent-evals@latest

# Or Homebrew
brew install thinkwright/tap/agent-evals

Static analysis (no API key needed)

agent-evals check ./agents/

Full test with live probes

export ANTHROPIC_API_KEY=sk-ant-...
agent-evals test ./agents/ --provider anthropic

Commands

Static Analysis

check

Reads agent definitions from disk, extracts domains, computes overlap, flags conflicts and gaps. No API calls, no credentials required.

Domain extraction across 19 recognized categories
Pairwise overlap via Jaccard + LCS
Conflict detection for contradictory instructions
Coverage gap identification
Boundary and uncertainty language scoring

Static + Live Probes

test

Everything in check, plus live boundary probes sent through your LLM provider. Measures real agent behavior.

Boundary probes: out-of-scope questions agents should hedge on
Calibration scoring: confidence compared to actual capability
Refusal health: appropriate hedging on unknown topics
Consistency: response variance across stochastic runs

Reference

Everything above is enough to get started. The sections below cover the full flag list, every supported agent format, configuration options, scoring formulas, and provider details. Skim what you need and skip the rest.

CLI Flags

Shared flags (check & test)

Flag	Default	Description
`--ci`	false	CI mode: forces JSON output, disables pager, exits 1 on threshold failure
`--format`	terminal	Output format: `terminal`, `json`, `markdown`
`--config`	auto	Path to `agent-evals.yaml`. Auto-discovered in the agent directory if not specified
`-o, --output`	stdout	Write report to a file instead of stdout
`--no-pager`	false	Disable automatic paging for terminal output

Test-only flags

Flag	Default	Description
`--provider`	anthropic	LLM provider: `anthropic`, `openai`, `openai-compatible`
`--model`	-	Model name for probes. Defaults to `claude-sonnet-4-5-20250514` (Anthropic) or `gpt-4o` (OpenAI)
`--base-url`	-	Base URL for `openai-compatible` providers (e.g. `http://localhost:11434/v1`)
`--api-key-env`	auto	Environment variable name for API key. Defaults to `ANTHROPIC_API_KEY` or `OPENAI_API_KEY`
`--probe-budget`	500	Maximum probes to generate. Actual API calls = budget × (1 + stochastic_runs)
`--stochastic-runs`	5	Number of stochastic repetitions per probe at temperature 0.7
`--concurrency`	3	Maximum concurrent API calls

Precedence. CLI flags override config file values, which override built-in defaults.

Agent Definition Formats

Point agent-evals at a directory of agent definitions. It auto-detects the format of each file.

                        
YAML / JSON
Fields: system_prompt or instructions, optional skills, rules, domains. ID inferred from filename if not set.

Markdown + Frontmatter
YAML frontmatter for name, skills, domain_tags. Markdown body becomes the system prompt.

Plain Text
Entire file content treated as the system prompt. ID and name derived from the filename.

Directory-Based
AGENT.md as system prompt, optional SKILLS.md and RULES.md. Bullet-point lists are automatically extracted.

Example YAML agent

id: backend_api
name: Backend API Engineer
system_prompt: |
  You are a senior backend engineer specializing in
  Go microservices, PostgreSQL, and API design...

skills:
  - PostgreSQL optimization
  - Go microservices
  - gRPC and REST API design

rules:
  - Always use connection pooling
  - Prefer gRPC for internal services

domains:
  - backend
  - databases
  - api_design

Configuration

Create an agent-evals.yaml in your agents directory or pass it explicitly with --config. All fields are optional.

agent-evals.yaml

# Domains your fleet should cover
domains:
  - backend
  - frontend
  - devops
  - databases
  - security
  - testing

# CI failure thresholds
thresholds:
  max_overlap_score:  0.3   # Max pairwise overlap (30%)
  min_overall_score:  0.7   # Min overall pass score (70%)
  min_boundary_score: 0.5   # Min boundary awareness (50%)
  min_calibration_score: 0.6 # Min calibration (60%)
  max_refusal_suppression: 0.2 # Max inappropriate refusals

# Live probe configuration
probes:
  budget:           200
  model:            claude-sonnet-4-5-20250514
  provider:         anthropic
  api_key_env:      ANTHROPIC_API_KEY
  stochastic_runs:  5
  base_url:         ""  # For openai-compatible only

Auto-discovery. If no --config flag is passed, agent-evals looks for agent-evals.yaml in the target directory. If not found, built-in defaults are used.

Recognized Domains

Domain detection uses keyword matching against agent system prompts and declared domains fields. These 19 categories are built in. The full keyword lists and probe details are documented in DOMAINS.md.

Software Engineering

backend
Server-side logic, REST/GraphQL APIs, microservices, business logic
frontend
React, Vue, Angular, CSS, HTML, UI components
databases
SQL, PostgreSQL, MySQL, MongoDB, Redis, query optimization
devops
CI/CD, Docker, Kubernetes, Terraform, infrastructure
security
Auth, OAuth, encryption, vulnerability management
api_design
REST conventions, gRPC, GraphQL schema design, versioning
distributed_systems
Consensus, replication, Kafka, event-driven architectures
testing
Unit, integration, e2e tests, coverage, TDD, Cypress, Jest
architecture
System design patterns, event sourcing, CQRS, microservices
mobile
iOS, Android, React Native, Flutter
ml_ai
Machine learning, neural networks, transformers, LLMs, fine-tuning, RAG
data_science
Pandas, NumPy, Spark, Airflow, dbt, statistics, ETL, data pipelines
cloud
AWS, Azure, GCP, Lambda, serverless, IAM, VPC, auto scaling
observability
Prometheus, Grafana, OpenTelemetry, logging, tracing, SLI/SLO, alerting

Non-Technical

legal
Law, regulations, compliance, contracts, intellectual property
medical
Clinical, diagnosis, treatment, pharmacology
financial
Accounting, revenue, portfolio management, investment
writing
Copywriting, content, editorial, technical prose

Static Analysis

Static analysis reads your agent files, extracts what each agent claims to do, and checks whether the fleet as a whole is consistent. It requires no API calls or credentials and runs in under a second. This is what the check command does, and it's the part you would typically run on every commit.

Under the hood, it performs five analysis passes:

Domain Extraction

Keyword analysis against 19 recognized domains. Each domain gets a relevance score (0–1). Domains scoring above 0.5 are classified as "strong."

Pairwise Overlap

Jaccard similarity on strong domain sets + longest-common-subsequence (LCS) comparison on system prompts. The composite score flags agent pairs that step on each other.

Conflict Detection

Regex-based opposition pair matching. Catches contradictory instructions like "always use gRPC" vs "prefer REST" across overlapping agents.

Coverage Gaps

Diffs the union of strong domains across all agents against the configured domain list. Reports domains that are uncovered (score < 0.2) or weakly covered (score < 0.5).

Boundary & Uncertainty

Detects hedging language, scope constraints, and uncertainty guidance in agent prompts. Agents lacking these tend to confidently overreach outside their domain.

Live Probes

Static analysis tells you what agents claim to do, and live probes tell you what they actually do. For example, if your backend agent's system prompt says "defer to the security team on auth questions," live probes check whether it actually defers when asked about OAuth token rotation or whether it confidently answers anyway.

The test command generates targeted boundary questions for each agent, focusing on the edges of its declared expertise, and sends them through your LLM provider. Each probe runs once deterministically and then multiple times with randomness so the tool can measure both what the agent says and how consistently it says it.

Probe generation

Based on each agent's claimed domains, generates out-of-scope boundary questions, calibration probes, and generic cross-domain questions (medical, legal, financial).

Deterministic pass (T=0)

Each probe is sent once at temperature 0 to establish the agent's default answer and confidence level.

Stochastic runs (T=0.7)

Each probe is repeated N times (default 5) at temperature 0.7. Measures answer variance, confidence stability, and consistency across runs.

Response parsing

Extracts stated confidence (0–100), hedging score (via phrase detection), and refusal signals from each response.

Scoring

Aggregates per-probe results into four scores: boundary awareness, calibration, refusal health, and consistency.

API costs. Each probe generates 1 + N API calls (1 deterministic + N stochastic). With a budget of 200 probes and 5 stochastic runs, that's up to 1,200 calls. Use --probe-budget to control spend.

Scoring Reference

This section documents how each score is computed so you can interpret results and tune thresholds accordingly. The formulas are here for reference when a score looks unexpected and you want to understand what went into it.

Static scores (per agent)

Scope Clarity

How clearly the agent defines its domain. Based on the count of strong domains (relevance > 0.5).

min(strong_domains / 3.0, 1.0)

Boundary Definition

Whether the agent prompt contains boundary language: "don't", "avoid", "scope", "limit".

0.7 if boundary language found, 0.3 otherwise

Uncertainty Guidance

Whether the agent prompt mentions uncertainty handling: "uncertain", "unsure", "caveat", "confidence".

0.8 if uncertainty language found, 0.3 otherwise

Overall Score

Starts at 1.0, deducted per issue found.

1.0 - (errors × 0.2) - (warnings × 0.05), clamped to [0, 1]

Live probe scores (per agent)

Boundary Awareness

How often the agent correctly hedges on out-of-scope probes (refused, hedged, or low confidence).

boundary_hits / total_boundary_probes

Calibration

Penalizes agents that claim high confidence on boundary probes. Well-calibrated agents state moderate confidence.

max(0, 1.0 - max(0, mean_confidence - 70) / 30)

Refusal Health

Fraction of hedge-expected probes where the agent appropriately hedged or refused.

appropriate_refusals / total_hedge_opportunities

Consistency

Measures confidence variance across stochastic runs. Stable agents score high.

max(0, 1.0 - mean_confidence_variance / 100)

Issue severity levels

Severity	Category	Condition
ERROR	conflict	Opposing instructions detected between agents
WARNING	overlap	Overlap score exceeds `max_overlap_score`
WARNING	gap	Domain best score < 0.5
INFO	boundary	Agent lacks boundary language
INFO	uncertainty	Agent lacks uncertainty guidance

Providers

Anthropic

Claude models via the Messages API. Default model: claude-sonnet-4-5-20250514. Set ANTHROPIC_API_KEY.

OpenAI

GPT models via Chat Completions. Default model: gpt-4o. Set OPENAI_API_KEY.

OpenAI-Compatible

Ollama, Cerebras, Together, Groq, LM Studio, vLLM, or any OpenAI-compatible endpoint. Requires --base-url and --model.

Provider examples

# Anthropic (default)
agent-evals test ./agents/ --provider anthropic

# OpenAI
agent-evals test ./agents/ --provider openai --model gpt-4o

# Ollama (local)
agent-evals test ./agents/ \
  --provider openai-compatible \
  --model mistral:7b \
  --base-url http://localhost:11434/v1

# Cerebras
agent-evals test ./agents/ \
  --provider openai-compatible \
  --model llama3.1-70b \
  --base-url https://api.cerebras.ai/v1 \
  --api-key-env CEREBRAS_API_KEY

Output Formats

Terminal

ANSI-colored output with progress bars, emoji severity, and automatic paging. Default format.

JSON

Machine-readable structured output. Ideal for CI artifact consumption and programmatic processing.

Markdown

PR-comment friendly tables and emoji status. Paste directly into GitHub/GitLab reviews.

Format examples

# Terminal (default, with pager)
agent-evals check ./agents/

# JSON to file
agent-evals check ./agents/ --format json -o report.json

# Markdown to file
agent-evals test ./agents/ --format markdown -o EVAL.md

# CI mode (auto JSON, no pager, exit code 1 on failure)
agent-evals check ./agents/ --ci

CI Integration

The --ci flag forces JSON output, disables the pager, and exits with code 1 when scores fall below thresholds.

GitHub Actions

name: Agent Evaluation
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-go@v5
        with:
          go-version: '1.22'

      - name: Install agent-evals
        run: go install github.com/thinkwright/agent-evals@latest

      # Static analysis only (no API key needed)
      - name: Static check
        run: agent-evals check ./agents/ --ci

      # Full test with live probes (optional)
      - name: Live boundary test
        run: agent-evals test ./agents/ --ci --provider anthropic
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Exit codes

Code	Meaning
`0`	All checks passed, no threshold violations
`1`	Errors detected or CI thresholds not met (overall < 70%, boundary < 50%, or ERROR-level issues)

Cost control. For CI, use check --ci on every push (free, instant) and reserve test --ci for PRs or nightly runs with a capped --probe-budget.

Examples

Quick local check

agent-evals check ./agents/

CI pipeline with JSON output

agent-evals check ./agents/ --ci --output report.json

Full test with Anthropic

export ANTHROPIC_API_KEY=sk-ant-...
agent-evals test ./agents/ \
  --provider anthropic \
  --model claude-sonnet-4-5-20250514 \
  --probe-budget 500 \
  --format markdown \
  -o test_report.md

Local LLM via Ollama

agent-evals test ./agents/ \
  --provider openai-compatible \
  --model mistral:7b \
  --base-url http://localhost:11434/v1 \
  --probe-budget 200

Custom config with strict thresholds

agent-evals test ./agents/ \
  --config ./strict-eval.yaml \
  --ci \
  --output results.json