Q1 '26

State of the Agent

Brandon Huey · February 2026

Do coding agents know what they don't know?

Developers are writing system prompts that define entire roles for AI: backend architect, security reviewer, database migration lead. More than tools, they're job descriptions for autonomous collaborators, shared across teams as open-source plugins. When thousands of people independently write agent configurations for the same platform, patterns emerge. This census examines those patterns, with particular attention to two: how well agents define their boundaries, and whether they acknowledge uncertainty at all.

This census takes the wshobson/agents repository, one of the largest collections of Claude Code agent definitions on GitHub with nearly 30,000 stars, and runs its 69 plugin directories through agent-evals. agent-evals has two modes: static analysis that checks each system prompt for domain coverage, overlap, boundary language, and uncertainty guidance across 18 evaluation domains, and an LLM harness that probes agents at runtime to measure whether their boundaries actually hold.

The result is a snapshot of 375 agents, skills, and commands: how they define their scope, where they overlap, and what patterns emerge in how developers write agent configurations. None of this is a judgment on quality. It's a look at the structural patterns of a fast-moving ecosystem.

One pattern stood out: while 72% of agents include language defining their boundaries, fewer than 9% say anything about what they don't know. And when those boundaries were tested at runtime, most agents ignored them entirely — confidently answering outside their domain with a mean boundary score of just 0.30. That gap between what agents say and what they do is worth paying attention to as they take on more autonomous roles.

375 primitives
69 plugins
18 domains
57,565 pairwise overlaps
72% boundaries
<9% uncertainty
By the Numbers
The Census
375 primitives scanned
119 agents
69 plugins
18 domains evaluated
57,565 pairwise overlaps
23% significant overlap
72% define boundaries
<9% address uncertainty
10,000 LLM calls
0.30 avg boundary score
Chapter 1

What Are Agent Primitives?

Taxonomy
Three Primitives
Agent
Full persona in AGENT.md. Own context window, tools, model.
Skill
Reusable capability in SKILL.md. Procedural, slash-invoked.
Command
Bounded action in commands/. Legacy, now merged into skills.
Convention-driven, not contract-driven. No schema, no types — just natural language.

Claude Code is Anthropic's CLI for AI-assisted software development. Its plugin system lets developers extend Claude's capabilities by defining specialized configurations in Markdown and YAML files. These configurations come in three types, referred to here as agent primitives.

Primitive What It Is Defined In
Agent A full persona with its own system prompt, tool access, and model configuration. Runs in an isolated context window. Think of it as a job description for the AI: scope of responsibility, areas of expertise, rules of engagement. AGENT.md
Skill A reusable capability or workflow that extends what Claude can do. Can be invoked as a slash command or loaded automatically when relevant. More procedural than persona-driven. SKILL.md
Command A slash-command shortcut that triggers a specific, bounded action. The older convention, now functionally merged into skills but still widely used. .md files in commands/

A plugin bundles all three. A single plugin directory might contain 2 agents, 5 skills, and 3 commands, all working together. The wshobson/agents repository organizes its contributions this way: 69 plugin directories, each contributed by a different author, containing a total of 119 agents, 153 skills, and 81 commands.

What makes this interesting is that Claude Code plugins are configuration-driven behavior augmentation. The system is convention-driven, not contract-driven: There is no schema validation, type checking, or enforced interface. A plugin is just Markdown and YAML files in a directory that follows naming conventions. Claude interprets the natural language in those files and adapts its behavior accordingly. This means the quality, clarity, and structure of that text directly shapes what the agent actually does.

This taxonomy matters for the analysis that follows. Agents with full personas need boundaries and uncertainty guidance more than simple commands do. When measuring "does this entity define what it doesn't know?", the answer carries different weight depending on whether it's an autonomous agent or a one-line command shortcut.

Chapter 2

Methodology

Data source. The wshobson/agents repository, one of the largest collections of Claude Code agent definitions on GitHub with nearly 30,000 stars. All 69 plugin directories were scanned recursively, processing 496 total files and deduplicating by content hash to arrive at 375 unique primitives.

Tool. agent-evals v0.3.0 is an open-source analysis tool for agent configurations. It reads system prompts in YAML, JSON, Markdown, and other formats. The tool has two modes: deterministic static analysis (keyword matching, regex scoring, Jaccard overlap) and a live probe mode that uses LLM-driven testing to evaluate agent behavior at runtime. This census uses both.

Runtime probes. In addition to static analysis, a behavioral probe was run using agent-evals' LLM harness mode. Each agent's system prompt was tested with calibrated questions designed to probe domain boundaries — questions deliberately outside the agent's claimed expertise. The LLM (Llama 3.3 70B) was given each agent's system prompt and asked to answer, reporting its confidence. Each probe ran once deterministically (temperature 0) and three times stochastically (temperature 0.7) to measure consistency. 420 of 428 agents were probed across 2,500 questions and 10,000 API calls. Use the Static / Runtime tabs on charts in Chapters 7 and 8 to compare the two perspectives.

What Gets Measured

Domain coverage. Each system prompt is checked for keywords across 18 evaluation domains (backend, frontend, security, databases, testing, etc.). An agent matching half the keywords in a domain scores 1.0 for that domain. This is keyword-based pattern matching, not semantic understanding.

Pairwise overlap. For every pair of primitives, Jaccard similarity is computed on their sets of strong domains (score > 0.3). This produces 57,565 overlap scores. The analysis also checks for direct contradictions: cases where one agent says "always use X" and another says "never use X."

Boundary language. A regex check for words like "don't," "avoid," "outside," "limit," "boundary," or "refer to" in the system prompt. If present, the primitive scores 0.7 for boundary definition; if absent, 0.3.

Uncertainty guidance. A regex check for words like "uncertain," "unsure," "don't know," "not sure," or "confidence." If present, the primitive scores 0.8; if absent, 0.3.

The static analysis layer is fully deterministic — two runs on the same input produce identical results. The runtime probe layer introduces controlled randomness (three stochastic runs per probe) to measure behavioral variance. Together, they reveal where an agent's configuration (what it says) diverges from its behavior (what it does).
Chapter 3

Ecosystem Flow

How does the ecosystem organize itself? When developers create agent plugins, which domains do they target, and how do plugins distribute their focus? The Sankey diagram below maps the flow from plugins (left) to domains (right). Link width is proportional to how many primitives in a plugin claim a given domain. Small plugins with fewer than 4 total links are grouped under "Other" to keep things readable.

Click to enable zoom
Loading flow data...
Figure 1. Plugin-to-domain Sankey diagram. Left nodes are plugins (blue), right nodes are the 18 evaluation domains (amber). Hover for details. Zoom and pan with scroll and drag.

The flow reveals a heavily concentrated ecosystem. A handful of domains attract the vast majority of plugin attention, while others see relatively little coverage. Backend development, security, and testing are the thickest arteries, with most plugins claiming at least some presence in these areas.

Notice the pattern of fan-out: many plugins connect to the same popular domains. This isn't surprising since these domains represent core software engineering work, but it does mean that users installing multiple plugins are likely to end up with significant redundancy in those areas.

Network Insight
Domain Concentration
Backend development and security dominate the ecosystem, with 40% of agent claims concentrated in just 5 domains.
This concentration suggests opportunity for differentiation in emerging domains like cloud infrastructure and database design.
Chapter 4

The Domain Landscape

Domain Insight
18 Domains
Top 5 claim 40% of coverage
Broad: backend, security, testing, frontend, DevOps
Niche: mobile, data science, distributed systems
Non-tech: legal, medical, financial

If every agent declares what it knows, where does the collective knowledge cluster? The treemap below sizes each of the 18 evaluation domains by how many primitives claim them. The selectivity slider filters out domains that match too broadly: a domain with high selectivity matches few agents (more meaningful signal), while one with low selectivity matches most agents (potentially just noise from generic language).

0.70
Loading domain data...
Figure 2. Domain distribution treemap. Tile area encodes agent count. Darker tiles indicate higher selectivity (fewer agents match). Drag the slider to filter out overly broad domains.

At the default threshold, several broad domains dominate the map. As you lower the selectivity cutoff, the generalist domains drop away and the more specialized ones come into focus. This is where the interesting structure lives: domains that are claimed by a meaningful subset of agents rather than nearly all of them.

The domains that survive aggressive filtering tend to be the most technically specific (mobile, distributed systems, data science) and the non-technical domains (legal, medical, financial). These represent genuine specialization rather than incidental keyword matching.

Chapter 5

Inside a Plugin

Plugin Insight
Internal Cohesion
0.39 avg cohesion
High cohesion = skills designed to support their agents
Low cohesion = grab-bag of unrelated capabilities

A plugin is more than a collection of agents. The best-designed plugins have internal coherence: their agents, skills, and commands share enough functional ground to work together, without so much overlap that they become redundant. The radial graph below shows the internal structure of each plugin. Nodes are colored by type and lines connect entities that share functional overlap.

---
Loading plugin data...
Figure 3. Plugin anatomy graph. Nodes are colored by type: agents, skills, commands. Line thickness encodes overlap score. Dashed red lines indicate conflicts. Cohesion is the average agent-to-skill overlap within the plugin.

Cohesion is the average overlap score between agent-type and skill-type entities within a plugin. Higher cohesion suggests the skills are designed to support the agents (they share functional ground). Very low cohesion might indicate a grab-bag of unrelated capabilities. The ecosystem average is 0.39.

Browse through several plugins to see the variety: some are tightly focused with a single agent supported by several skills, others are broad toolkits spanning many domains.

Agent Lookup

Search for specific primitives by name. Results show type, plugin, word count, and whether boundary and uncertainty language are present.

Chapter 6

The Overlap Problem

When two plugins independently define agents for the same domain, what happens? The chord diagram below visualizes cross-plugin functional overlap. Each arc represents a plugin; chords connect plugins whose agents overlap above the threshold. Thicker chords mean more overlapping pairs. Red chords indicate detected conflicts: cases where one agent's system prompt says “always use X” or “prefer X” while another says “avoid X” or “never use X.” These aren't domain mismatches — they're related agents giving opposite advice about the same tools or practices.

0.70
Loading overlap data...
Figure 4. Cross-plugin overlap chord diagram. Arcs are plugins, chords connect plugins with overlapping agents above the threshold. Red indicates conflicts. Hover for detail.

23% of agent pairs show significant functional overlap. That number is worth sitting with. In an ecosystem of independently authored plugins, some overlap is natural and even healthy: it means popular domains are well-served. But the density of connections at even moderate thresholds suggests users installing several plugins will encounter real redundancy.

The more concerning signal is the red chords: direct contradictions between agents in different plugins. These are cases where one agent says "always use X" and another says "avoid X." A developer using both plugins gets conflicting guidance with no warning.

Complexity Insight
Redundant Agents
23% of agent pairs show significant functional overlap, indicating the potential for redundancy in the ecosystem.
This redundancy creates decision fatigue for users and maintenance overhead for plugin authors, without necessarily increasing coverage of unique capabilities.
Chapter 7

The Boundary Gap

Key Finding
The 8:1 Ratio
72% define boundaries
<9% address uncertainty
Agents know what they do, but rarely say what they don't know.

This is the most actionable finding in the census. Boundary language means the system prompt explicitly states what the agent should not do, or where its expertise ends ("I focus on backend development and should not be used for frontend work"). Uncertainty guidance means the prompt instructs the agent how to handle things it doesn't know ("If I'm unsure about a security implication, I'll flag it rather than guess").

The chart below breaks down both metrics by primitive type.

Loading boundary data...
Figure 5. Boundary definition vs. uncertainty guidance by primitive type. Green bars show the percentage with boundary language; red bars show uncertainty guidance.

72% of primitives include some form of boundary language, but fewer than 9% address uncertainty. That's an 8:1 ratio. Agents are the best of the three types at both, but even among agents the uncertainty gap is stark.

This matters because an agent that confidently defines its scope ("I am a security expert") but never acknowledges limits ("If I encounter an unfamiliar vulnerability class...") may produce authoritative-sounding answers in areas where it should hedge. The boundary gap is the distance between claiming competence and acknowledging limits.

Switch to Runtime to see the say-do gap. When 2,500 calibrated probes were sent to these same agents through an LLM harness, the picture changed dramatically. The mean boundary respect score dropped to 0.30 — most agents confidently answered questions outside their claimed domain, regardless of what their system prompt said. Refusal health averaged just 0.04: almost no agent refused when it should have. Writing "I only handle backend tasks" in a prompt and actually declining frontend questions are very different things.

Boundary Audit

Primitives with high scope clarity but no boundary or uncertainty language. These confidently claim broad scope with no guardrails.

Chapter 8

Score Distributions

The previous chapters presented aggregate findings. Below are the full distributions of the three key metrics across all 375 primitives, broken down by type. This reveals the shape of the data behind the averages.

Loading score data...
Figure 6. Score distributions for scope clarity, boundary definition, and uncertainty guidance. Bars are grouped by primitive type: agents, skills, commands.
Distribution
Three Metrics
Scope clusters high — most agents say what they do
Boundary bimodal — present or absent, no middle ground
Uncertainty overwhelmingly absent across all types

Scope clarity clusters high. Most primitives clearly state what they do, which makes sense since a system prompt that doesn't describe its purpose wouldn't be very useful.

Boundary definition shows a bimodal split: a cluster near 0.3 (no boundary language detected) and another near 0.7 (boundary language present). There's little middle ground, which reflects the binary nature of the regex check.

Uncertainty guidance clusters overwhelmingly near 0.3 (absent) for all types. The handful of primitives that do address uncertainty are scattered without a clear pattern by type.

Chapter 9

Limitations

This analysis has meaningful constraints that should inform how you interpret the results.

Chapter 10

What This Means

The agent ecosystem is growing fast and organically. Thousands of developers are independently writing configurations that define how AI collaborators should behave, and a few patterns are worth noting.

The say-do gap is the headline finding. 72% of agents include boundary language in their system prompts, but when tested with out-of-scope questions, the mean runtime boundary score is just 0.30. Most agents confidently answer outside their domain regardless of what their prompt says. Refusal health averages 0.04 — almost no agent refuses when it should. Writing boundaries is easy; making them work is the harder problem.

The boundary gap is still the lowest-hanging fruit. Adding explicit scope limits and uncertainty instructions to a system prompt is straightforward. Plugin authors could meaningfully improve their agents by adding a few lines about what the agent should not attempt and how it should handle the edges of its knowledge. But authors should test those boundaries at runtime, not just assume the language will stick.

Concentration creates both opportunity and risk. The heavy clustering around backend, security, and testing domains means those areas are well-served but potentially oversaturated. Developers building new plugins might find more impact in underserved domains like cloud infrastructure, observability, or data science.

Overlap isn't inherently bad, but conflicts are. Multiple plugins covering the same domain gives users choice. But contradictory guidance between plugins is a real problem that currently has no systematic detection mechanism beyond tools like agent-evals.

Skills need boundaries too. Skills and commands have even lower boundary and uncertainty rates than agents. As these primitives take on more autonomous roles with features like forked execution contexts, their need for guardrails grows.

Future editions of this census will track how the ecosystem evolves — whether the boundary gap narrows as awareness grows, and how new Claude Code features change the patterns in community-authored configurations.

Related research. While this census examines the structure of agent configurations, a complementary question is whether those configurations actually help. Gloaguen et al. ("Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?", ETH Zurich, Feb. 2026) evaluated context files (AGENTS.md, CLAUDE.md) across four coding agents and two benchmarks. Their findings are nuanced: LLM-generated context files actually decreased task success rates by 2-3% while increasing cost by over 20%, while human-written files provided a modest ~4% improvement. Agents do follow the instructions in context files (tools mentioned get used 1.6-2.5x more), but more instructions also mean more exploration, more testing, and more reasoning tokens spent. The authors conclude that "unnecessary requirements make tasks harder" and that context files "should describe only minimal requirements." This parallels the boundary gap finding in this census: the challenge isn't just whether agents have guardrails, but whether the right guardrails are expressed concisely enough to actually help. Their work evaluates effectiveness at the repository level; this census maps the structural patterns across an entire ecosystem. Together, they suggest that the quality and minimalism of agent configurations matters more than their quantity.
Appendix

How This Was Built

This report runs entirely in your browser. The data is stored in Parquet files served over HTTP, and every chart, filter, and mini-tool queries those files using SQL executed client-side in WebAssembly. There is no backend API processing your interactions.