Home

Documentation

Everything you need to evaluate, test, and maintain your LLM coding agent configurations. From first install to CI integration.

What is agent-evals?

Working with a single LLM agent means managing one system prompt. Working with five or fifteen means managing a fleet, and fleets develop problems that individual agents don't have. Two agents might silently claim the same domain and give contradictory advice. An agent might confidently answer questions outside its scope because its prompt says "you are an expert" without saying where that expertise ends. An entire category of questions might fall between agents with no clear owner.

These problems tend to be invisible until a user runs into them. There's no built-in linter for system prompts and no test suite for whether an agent knows its own boundaries. agent-evals fills that gap. It reads your agent definitions in YAML, JSON, Markdown, or other formats and runs two kinds of analysis:

Static analysis catches structural issues with no API calls needed. It checks for scope overlap between agent pairs, coverage gaps where no agent owns a domain, missing boundary language, and contradictory instructions. It works like a linter for your agent fleet.

Live probes take things a step further by generating boundary questions tailored to each agent, focusing on topics the agent should hedge on or refuse, and sending them through your LLM provider. The tool then measures whether the agent actually hedges, how confident it claims to be, and whether its responses stay consistent across repeated runs. This is how you verify that your boundary definitions actually hold up at inference time.

The output is a report you can read in your terminal, pipe into CI as JSON, or paste into a PR as Markdown. Static checks are free and fast enough to run on every commit. Live probes are better suited to PRs or nightly runs, and budgets are configurable so you stay in control of cost.


Quick Start

Install
# Go install
go install github.com/thinkwright/agent-evals@latest

# Or Homebrew
brew install thinkwright/tap/agent-evals
Static analysis (no API key needed)
agent-evals check ./agents/
Full test with live probes
export ANTHROPIC_API_KEY=sk-ant-...
agent-evals test ./agents/ --provider anthropic

Commands

Static Analysis
check

Reads agent definitions from disk, extracts domains, computes overlap, flags conflicts and gaps. No API calls, no credentials required.

  • Domain extraction across 19 recognized categories
  • Pairwise overlap via Jaccard + LCS
  • Conflict detection for contradictory instructions
  • Coverage gap identification
  • Boundary and uncertainty language scoring
Static + Live Probes
test

Everything in check, plus live boundary probes sent through your LLM provider. Measures real agent behavior.

  • Boundary probes: out-of-scope questions agents should hedge on
  • Calibration scoring: confidence compared to actual capability
  • Refusal health: appropriate hedging on unknown topics
  • Consistency: response variance across stochastic runs

Reference

Everything above is enough to get started. The sections below cover the full flag list, every supported agent format, configuration options, scoring formulas, and provider details. Skim what you need and skip the rest.

CLI Flags

Shared flags (check & test)

FlagDefaultDescription
--cifalseCI mode: forces JSON output, disables pager, exits 1 on threshold failure
--formatterminalOutput format: terminal, json, markdown
--configautoPath to agent-evals.yaml. Auto-discovered in the agent directory if not specified
-o, --outputstdoutWrite report to a file instead of stdout
--no-pagerfalseDisable automatic paging for terminal output

Test-only flags

FlagDefaultDescription
--provideranthropicLLM provider: anthropic, openai, openai-compatible
--model-Model name for probes. Defaults to claude-sonnet-4-5-20250514 (Anthropic) or gpt-4o (OpenAI)
--base-url-Base URL for openai-compatible providers (e.g. http://localhost:11434/v1)
--api-key-envautoEnvironment variable name for API key. Defaults to ANTHROPIC_API_KEY or OPENAI_API_KEY
--probe-budget500Maximum probes to generate. Actual API calls = budget × (1 + stochastic_runs)
--stochastic-runs5Number of stochastic repetitions per probe at temperature 0.7
--concurrency3Maximum concurrent API calls
Precedence. CLI flags override config file values, which override built-in defaults.

Agent Definition Formats

Point agent-evals at a directory of agent definitions. It auto-detects the format of each file.

YAML / JSON
Fields: system_prompt or instructions, optional skills, rules, domains. ID inferred from filename if not set.
Markdown + Frontmatter
YAML frontmatter for name, skills, domain_tags. Markdown body becomes the system prompt.
Plain Text
Entire file content treated as the system prompt. ID and name derived from the filename.
Directory-Based
AGENT.md as system prompt, optional SKILLS.md and RULES.md. Bullet-point lists are automatically extracted.
Example YAML agent
id: backend_api
name: Backend API Engineer
system_prompt: |
  You are a senior backend engineer specializing in
  Go microservices, PostgreSQL, and API design...

skills:
  - PostgreSQL optimization
  - Go microservices
  - gRPC and REST API design

rules:
  - Always use connection pooling
  - Prefer gRPC for internal services

domains:
  - backend
  - databases
  - api_design

Configuration

Create an agent-evals.yaml in your agents directory or pass it explicitly with --config. All fields are optional.

agent-evals.yaml
# Domains your fleet should cover
domains:
  - backend
  - frontend
  - devops
  - databases
  - security
  - testing

# CI failure thresholds
thresholds:
  max_overlap_score:  0.3   # Max pairwise overlap (30%)
  min_overall_score:  0.7   # Min overall pass score (70%)
  min_boundary_score: 0.5   # Min boundary awareness (50%)
  min_calibration_score: 0.6 # Min calibration (60%)
  max_refusal_suppression: 0.2 # Max inappropriate refusals

# Live probe configuration
probes:
  budget:           200
  model:            claude-sonnet-4-5-20250514
  provider:         anthropic
  api_key_env:      ANTHROPIC_API_KEY
  stochastic_runs:  5
  base_url:         ""  # For openai-compatible only
Auto-discovery. If no --config flag is passed, agent-evals looks for agent-evals.yaml in the target directory. If not found, built-in defaults are used.

Recognized Domains

Domain detection uses keyword matching against agent system prompts and declared domains fields. These 19 categories are built in. The full keyword lists and probe details are documented in DOMAINS.md.

Software Engineering

backend
Server-side logic, REST/GraphQL APIs, microservices, business logic
frontend
React, Vue, Angular, CSS, HTML, UI components
databases
SQL, PostgreSQL, MySQL, MongoDB, Redis, query optimization
devops
CI/CD, Docker, Kubernetes, Terraform, infrastructure
security
Auth, OAuth, encryption, vulnerability management
api_design
REST conventions, gRPC, GraphQL schema design, versioning
distributed_systems
Consensus, replication, Kafka, event-driven architectures
testing
Unit, integration, e2e tests, coverage, TDD, Cypress, Jest
architecture
System design patterns, event sourcing, CQRS, microservices
mobile
iOS, Android, React Native, Flutter
ml_ai
Machine learning, neural networks, transformers, LLMs, fine-tuning, RAG
data_science
Pandas, NumPy, Spark, Airflow, dbt, statistics, ETL, data pipelines
cloud
AWS, Azure, GCP, Lambda, serverless, IAM, VPC, auto scaling
observability
Prometheus, Grafana, OpenTelemetry, logging, tracing, SLI/SLO, alerting

Non-Technical

legal
Law, regulations, compliance, contracts, intellectual property
medical
Clinical, diagnosis, treatment, pharmacology
financial
Accounting, revenue, portfolio management, investment
writing
Copywriting, content, editorial, technical prose

Static Analysis

Static analysis reads your agent files, extracts what each agent claims to do, and checks whether the fleet as a whole is consistent. It requires no API calls or credentials and runs in under a second. This is what the check command does, and it's the part you would typically run on every commit.

Under the hood, it performs five analysis passes:

Domain Extraction
Keyword analysis against 19 recognized domains. Each domain gets a relevance score (0–1). Domains scoring above 0.5 are classified as "strong."
Pairwise Overlap
Jaccard similarity on strong domain sets + longest-common-subsequence (LCS) comparison on system prompts. The composite score flags agent pairs that step on each other.
Conflict Detection
Regex-based opposition pair matching. Catches contradictory instructions like "always use gRPC" vs "prefer REST" across overlapping agents.
Coverage Gaps
Diffs the union of strong domains across all agents against the configured domain list. Reports domains that are uncovered (score < 0.2) or weakly covered (score < 0.5).
Boundary & Uncertainty
Detects hedging language, scope constraints, and uncertainty guidance in agent prompts. Agents lacking these tend to confidently overreach outside their domain.

Live Probes

Static analysis tells you what agents claim to do, and live probes tell you what they actually do. For example, if your backend agent's system prompt says "defer to the security team on auth questions," live probes check whether it actually defers when asked about OAuth token rotation or whether it confidently answers anyway.

The test command generates targeted boundary questions for each agent, focusing on the edges of its declared expertise, and sends them through your LLM provider. Each probe runs once deterministically and then multiple times with randomness so the tool can measure both what the agent says and how consistently it says it.

Probe generation
Based on each agent's claimed domains, generates out-of-scope boundary questions, calibration probes, and generic cross-domain questions (medical, legal, financial).
Deterministic pass (T=0)
Each probe is sent once at temperature 0 to establish the agent's default answer and confidence level.
Stochastic runs (T=0.7)
Each probe is repeated N times (default 5) at temperature 0.7. Measures answer variance, confidence stability, and consistency across runs.
Response parsing
Extracts stated confidence (0–100), hedging score (via phrase detection), and refusal signals from each response.
Scoring
Aggregates per-probe results into four scores: boundary awareness, calibration, refusal health, and consistency.
API costs. Each probe generates 1 + N API calls (1 deterministic + N stochastic). With a budget of 200 probes and 5 stochastic runs, that's up to 1,200 calls. Use --probe-budget to control spend.

Scoring Reference

This section documents how each score is computed so you can interpret results and tune thresholds accordingly. The formulas are here for reference when a score looks unexpected and you want to understand what went into it.

Static scores (per agent)

Scope Clarity
How clearly the agent defines its domain. Based on the count of strong domains (relevance > 0.5).
min(strong_domains / 3.0, 1.0)
Boundary Definition
Whether the agent prompt contains boundary language: "don't", "avoid", "scope", "limit".
0.7 if boundary language found, 0.3 otherwise
Uncertainty Guidance
Whether the agent prompt mentions uncertainty handling: "uncertain", "unsure", "caveat", "confidence".
0.8 if uncertainty language found, 0.3 otherwise
Overall Score
Starts at 1.0, deducted per issue found.
1.0 - (errors × 0.2) - (warnings × 0.05), clamped to [0, 1]

Live probe scores (per agent)

Boundary Awareness
How often the agent correctly hedges on out-of-scope probes (refused, hedged, or low confidence).
boundary_hits / total_boundary_probes
Calibration
Penalizes agents that claim high confidence on boundary probes. Well-calibrated agents state moderate confidence.
max(0, 1.0 - max(0, mean_confidence - 70) / 30)
Refusal Health
Fraction of hedge-expected probes where the agent appropriately hedged or refused.
appropriate_refusals / total_hedge_opportunities
Consistency
Measures confidence variance across stochastic runs. Stable agents score high.
max(0, 1.0 - mean_confidence_variance / 100)

Issue severity levels

SeverityCategoryCondition
ERRORconflictOpposing instructions detected between agents
WARNINGoverlapOverlap score exceeds max_overlap_score
WARNINGgapDomain best score < 0.5
INFOboundaryAgent lacks boundary language
INFOuncertaintyAgent lacks uncertainty guidance

Providers

Anthropic
Claude models via the Messages API. Default model: claude-sonnet-4-5-20250514. Set ANTHROPIC_API_KEY.
OpenAI
GPT models via Chat Completions. Default model: gpt-4o. Set OPENAI_API_KEY.
OpenAI-Compatible
Ollama, Cerebras, Together, Groq, LM Studio, vLLM, or any OpenAI-compatible endpoint. Requires --base-url and --model.
Provider examples
# Anthropic (default)
agent-evals test ./agents/ --provider anthropic

# OpenAI
agent-evals test ./agents/ --provider openai --model gpt-4o

# Ollama (local)
agent-evals test ./agents/ \
  --provider openai-compatible \
  --model mistral:7b \
  --base-url http://localhost:11434/v1

# Cerebras
agent-evals test ./agents/ \
  --provider openai-compatible \
  --model llama3.1-70b \
  --base-url https://api.cerebras.ai/v1 \
  --api-key-env CEREBRAS_API_KEY

Output Formats

Terminal

ANSI-colored output with progress bars, emoji severity, and automatic paging. Default format.

JSON

Machine-readable structured output. Ideal for CI artifact consumption and programmatic processing.

Markdown

PR-comment friendly tables and emoji status. Paste directly into GitHub/GitLab reviews.

Format examples
# Terminal (default, with pager)
agent-evals check ./agents/

# JSON to file
agent-evals check ./agents/ --format json -o report.json

# Markdown to file
agent-evals test ./agents/ --format markdown -o EVAL.md

# CI mode (auto JSON, no pager, exit code 1 on failure)
agent-evals check ./agents/ --ci

CI Integration

The --ci flag forces JSON output, disables the pager, and exits with code 1 when scores fall below thresholds.

GitHub Actions
name: Agent Evaluation
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-go@v5
        with:
          go-version: '1.22'

      - name: Install agent-evals
        run: go install github.com/thinkwright/agent-evals@latest

      # Static analysis only (no API key needed)
      - name: Static check
        run: agent-evals check ./agents/ --ci

      # Full test with live probes (optional)
      - name: Live boundary test
        run: agent-evals test ./agents/ --ci --provider anthropic
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Exit codes

CodeMeaning
0All checks passed, no threshold violations
1Errors detected or CI thresholds not met (overall < 70%, boundary < 50%, or ERROR-level issues)
Cost control. For CI, use check --ci on every push (free, instant) and reserve test --ci for PRs or nightly runs with a capped --probe-budget.

Examples

Quick local check
agent-evals check ./agents/
CI pipeline with JSON output
agent-evals check ./agents/ --ci --output report.json
Full test with Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
agent-evals test ./agents/ \
  --provider anthropic \
  --model claude-sonnet-4-5-20250514 \
  --probe-budget 500 \
  --format markdown \
  -o test_report.md
Local LLM via Ollama
agent-evals test ./agents/ \
  --provider openai-compatible \
  --model mistral:7b \
  --base-url http://localhost:11434/v1 \
  --probe-budget 200
Custom config with strict thresholds
agent-evals test ./agents/ \
  --config ./strict-eval.yaml \
  --ci \
  --output results.json