Live Probe Scoring
ScoreWhat it measures
Boundary awarenessHow often the agent correctly hedges on out-of-scope questions
CalibrationWhether stated confidence matches demonstrated accuracy
Refusal healthWhether the agent says "I don't know" when it should
ConsistencyHow stable responses are across repeated stochastic runs
Agent Definition Formats
YAML / JSON
Fields: system_prompt or instructions, plus optional skills, rules, claimed_domains
Markdown with Frontmatter
YAML frontmatter for metadata, body becomes the system prompt. Supports name, skills, domain_tags
Plain Text
Entire file content treated as the system prompt. ID and name derived from filename
Directory-Based
AGENT.md + optional SKILLS.md and RULES.md. Lists extracted from markdown bullet points
Static Analysis Categories
Domain Extraction
Keyword analysis across 19 recognized domains: backend, frontend, devops, databases, security, cloud, observability, data science, and more
Pairwise Overlap
Jaccard similarity on domain sets + LCS-based prompt comparison. Composite score flags agent pairs that step on each other
Conflict Detection
Regex-based opposition pair matching that catches contradictory instructions like "always use gRPC" vs "prefer REST"
Coverage Gaps
Diffs the union of claimed domains against recognized categories. Reports uncovered and weakly-covered areas
Boundary & Uncertainty
Detects hedging language, scope constraints, and uncertainty guidance in agent prompts. Agents without these confidently overreach
Default CI Thresholds
max_overlap_score 0.3 (30%)
min_calibration_score 0.6 (60%)
min_boundary_score 0.5 (50%)
max_refusal_suppression 0.2 (20%)
Install
Install agent-evals
Go Install
go install github.com/thinkwright/agent-evals@latest
Then run:
agent-evals check ./agents/
Homebrew
brew install thinkwright/tap/agent-evals
Binary Download
curl -fsSL https://github.com/thinkwright/agent-evals/releases/latest/download/agent-evals_$(uname -s)_$(uname -m).tar.gz | tar xz
Downloads the pre-built binary for your platform.

Agents Overlap

Claude Code
Cline
Cursor
Augment
Windsurf
Copilot
Aider
Custom YAML / JSON / Markdown

Detect scope overlap between agents.
Score boundary awareness and calibration.
Find coverage gaps across your fleet.
Test with live LLM probes or static analysis alone.

Scope Overlap
Pairwise Jaccard similarity on domain sets plus LCS-based prompt comparison. Detects contradictory instructions between overlapping agents.
Boundary Awareness
Scores agents on hedging language, uncertainty guidance, and explicit scope constraints. Agents without boundaries confidently answer outside their domain.
Coverage Gaps
Diffs the union of claimed domains against 19 recognized categories. Surfaces uncovered and weakly-covered areas in your agent fleet.
Live Probes
Generates boundary questions, sends them through your LLM provider, and measures calibration, refusal health, and stochastic consistency.

Two Modes

Static Analysis
agent-evals check

No API calls, no credentials. Reads agent definitions from disk, extracts domains via keyword analysis, computes pairwise overlap using Jaccard similarity and LCS-based prompt comparison, and flags conflicts and coverage gaps.

  • Domain extraction from system prompts across 19 recognized domains
  • Pairwise overlap scoring using Jaccard similarity on domain sets and LCS-based prompt similarity
  • Conflict detection via regex opposition pair matching
  • Coverage gap identification for uncovered and weakly-covered domains
Static + Live Probes
agent-evals test

Everything in check, plus live boundary probes. Generates out-of-scope questions for each agent, sends them through your LLM provider, and scores refusal health and calibration.

  • Boundary probes: out-of-scope questions that well-configured agents should hedge or refuse
  • Calibration scoring: measures whether confidence levels match actual capability
  • Refusal health: tracks appropriate hedging on questions agents shouldn't answer
  • Consistency: runs each probe multiple times at temperature 0.7 to measure response variance

How It Works

Point at your agent definitions
YAML, JSON, Markdown with frontmatter, plain text, or directory-based layouts. Auto-detects format, extracts system prompts, tool definitions, and routing rules.
Static analysis runs instantly
Domain extraction, pairwise overlap computation, conflict detection, gap analysis, and per-agent scoring. No API calls, no waiting.
Live probes test real behavior
With agent-evals test, boundary probes are sent through your LLM provider. One deterministic pass at temperature 0, then stochastic runs at 0.7 to measure response variance.
Get your report
Terminal output with color-coded scores, JSON for CI pipelines, or Markdown for PR comments. Use --ci for machine-friendly defaults.

Any Provider

Live probes support Anthropic, OpenAI, and any OpenAI-compatible endpoint. Run against hosted models or locally via Ollama.

Anthropic
Claude models via the Messages API. Set ANTHROPIC_API_KEY and go.
OpenAI
GPT models via Chat Completions. Use --provider openai.
OpenAI-Compatible
Ollama, Cerebras, Together, Groq, or any service with an OpenAI-compatible API. Use --provider openai-compatible --base-url http://localhost:11434/v1.

CI Integration

--ci outputs JSON, disables the pager, and exits with code 1 when scores fall below configurable thresholds.

# GitHub Actions
- name: Evaluate agents
  run: agent-evals check ./agents/ --ci

# With live probes
- name: Test agent boundaries
  run: agent-evals test ./agents/ --ci --provider anthropic
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Configurable thresholds. Set min_overall_score and min_boundary_score in your agent-evals.yaml to control when CI fails. Defaults: 70% overall, 50% boundary.