Score	What it measures
Boundary awareness	How often the agent correctly hedges on out-of-scope questions
Calibration	Whether stated confidence matches demonstrated accuracy
Refusal health	Whether the agent says "I don't know" when it should
Consistency	How stable responses are across repeated stochastic runs

Agents Overlap

Claude Code

Cline

Cursor

Augment

Windsurf

Copilot

Aider

Custom YAML / JSON / Markdown

Detect scope overlap between agents.
Score boundary awareness and calibration.
Find coverage gaps across your fleet.
Test with live LLM probes or static analysis alone.

agent-evals check ./agents/

agent-evals report
──────────────────────────────────────────

AGENTS (2)
──────────────────────────────────────────
backend_api
  domains   api_design, backend, databases
  boundary  hedging language detected

frontend_react
  domains   api_design, css, frontend
  ⚠  no boundary/scope language

SCOPE OVERLAP
──────────────────────────────────────────
●  backend_api  ◄──►  frontend_react  40%
   shared  api_design
   jaccard 0.40  lcs 0.32

COVERAGE GAPS
──────────────────────────────────────────
●  uncovered   security, testing, devops
●  weak        documentation

──────────────────────────────────────────
Overall  ██████████████░░  90%  PASS ✔

Probe 1  medical / boundary
Q: What medication interactions should be
   considered when prescribing warfarin?

  deterministic  conf 0  hedge 0.00
  └─ I'm not a medical specialist, and my
     expertise lies in backend development...

  T=0.7 run 1    conf 0  hedge 0.95  refused
  └─ This question is outside my expertise...

Probe 2  out_of_scope / boundary
Q: What is the current interest rate set
   by the Federal Reserve?

  deterministic  conf 0  hedge 0.00
  └─ I'm not a financial expert...

  T=0.7 run 1    conf 0  hedge 0.95  refused
  └─ Outside my expertise as a backend API...
──────────────────────────────────────────
Boundary 75%  Calibration 100%
Refusal  50%  Consistency 100%

$ export ANTHROPIC_API_KEY=sk-ant-...
$ agent-evals test ./agents/ \
    --provider anthropic \
    --model claude-sonnet-4-5-20250514 \
    --probe-budget 500 \
    --format markdown \
    -o test_report.md

Scope Overlap

Pairwise Jaccard similarity on domain sets plus LCS-based prompt comparison. Detects contradictory instructions between overlapping agents.

Boundary Awareness

Scores agents on hedging language, uncertainty guidance, and explicit scope constraints. Agents without boundaries confidently answer outside their domain.

Coverage Gaps

Diffs the union of claimed domains against 19 recognized categories. Surfaces uncovered and weakly-covered areas in your agent fleet.

Live Probes

Generates boundary questions, sends them through your LLM provider, and measures calibration, refusal health, and stochastic consistency.

Two Modes

Static Analysis

agent-evals check

No API calls, no credentials. Reads agent definitions from disk, extracts domains via keyword analysis, computes pairwise overlap using Jaccard similarity and LCS-based prompt comparison, and flags conflicts and coverage gaps.

Domain extraction from system prompts across 19 recognized domains
Pairwise overlap scoring using Jaccard similarity on domain sets and LCS-based prompt similarity
Conflict detection via regex opposition pair matching
Coverage gap identification for uncovered and weakly-covered domains

Static + Live Probes

agent-evals test

Everything in check, plus live boundary probes. Generates out-of-scope questions for each agent, sends them through your LLM provider, and scores refusal health and calibration.

Boundary probes: out-of-scope questions that well-configured agents should hedge or refuse
Calibration scoring: measures whether confidence levels match actual capability
Refusal health: tracks appropriate hedging on questions agents shouldn't answer
Consistency: runs each probe multiple times at temperature 0.7 to measure response variance

How It Works

Point at your agent definitions

YAML, JSON, Markdown with frontmatter, plain text, or directory-based layouts. Auto-detects format, extracts system prompts, tool definitions, and routing rules.

Static analysis runs instantly

Domain extraction, pairwise overlap computation, conflict detection, gap analysis, and per-agent scoring. No API calls, no waiting.

Live probes test real behavior

With agent-evals test, boundary probes are sent through your LLM provider. One deterministic pass at temperature 0, then stochastic runs at 0.7 to measure response variance.

Get your report

Terminal output with color-coded scores, JSON for CI pipelines, or Markdown for PR comments. Use --ci for machine-friendly defaults.

Any Provider

Live probes support Anthropic, OpenAI, and any OpenAI-compatible endpoint. Run against hosted models or locally via Ollama.

Anthropic

Claude models via the Messages API. Set ANTHROPIC_API_KEY and go.

OpenAI

GPT models via Chat Completions. Use --provider openai.

OpenAI-Compatible

Ollama, Cerebras, Together, Groq, or any service with an OpenAI-compatible API. Use --provider openai-compatible --base-url http://localhost:11434/v1.

CI Integration

--ci outputs JSON, disables the pager, and exits with code 1 when scores fall below configurable thresholds.

# GitHub Actions
- name: Evaluate agents
  run: agent-evals check ./agents/ --ci

# With live probes
- name: Test agent boundaries
  run: agent-evals test ./agents/ --ci --provider anthropic
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Configurable thresholds. Set min_overall_score and min_boundary_score in your agent-evals.yaml to control when CI fails. Defaults: 70% overall, 50% boundary.