Research Brief

Cognitive Yield

How much useful cognitive work is society extracting from the energy committed to AI?

Waste Taxonomy Evidence Table Useful Work Historical Pattern Impact Chart Entity Spotlight Comparison Assessment Conclusions

Stanford's Hazy Research lab measured a 5.3x improvement in "intelligence per watt" from 2023 to 2025, yet global AI datacenter energy consumption doubled over the same period and annual AI infrastructure capex from Amazon, Google, Meta, and Microsoft reached $700 billion in 2026. Per-token efficiency is improving faster than in any comparable technology transition, while aggregate cognitive output, measured against the capital deployed to produce it, shows signs of declining. The disconnect, as Stanford's own IPW researchers have noted, suggests that the industry's dominant metric, tokens per watt, captures hardware performance without capturing whether the output is useful.

"Tell me how you will measure me, and I will tell you how I will behave. If you measure me in an illogical way, do not complain about illogical behaviour."

Eli Goldratt, Theory of Constraints

Sequoia Capital quantified the financial expression of this measurement gap in mid-2024, estimating a $600 billion annual revenue shortfall between AI infrastructure spending and the revenue that infrastructure would need to generate to justify itself. J.P. Morgan subsequently calculated that earning a 10% return on the current AI buildout would require $650 billion in annual revenue, equivalent to a $35 perpetual annual payment from every iPhone user or $180 from every Netflix subscriber. These figures use financial throughput (revenue) as the denominator. The deeper question is whether the physical throughput metric that drives infrastructure investment, tokens per watt, even correlates with the economic value being produced.

Every major industry that has matured through a capital-intensive buildout phase has eventually abandoned its initial throughput metric in favor of an outcome metric. Airlines replaced Available Seat Miles with Revenue per Available Seat Mile; telecom carriers abandoned circuit-switched minutes in favor of Average Revenue Per User; and manufacturing shifted from units-per-hour to value-stream efficiency. In each case, the throughput metric masked waste, rewarded overproduction, and directed capital toward capacity rather than capability. AI infrastructure is at the same inflection point, and the stakes, measured in both energy and capital, have never been higher for a technology this early in its deployment curve.

The Waste Taxonomy

The gap between tokens generated and useful cognitive work delivered is not a single inefficiency but a stack of compounding losses, each invisible to the tokens-per-watt metric. OpenRouter now processes over 20 trillion tokens per week (a 12.7x year-over-year increase), and the average inference call consumes roughly 6,000 input tokens to produce 400 output tokens: a 15:1 ratio in which users see approximately 6.3% of all tokens processed. Early 2024 ratios were closer to 10:1, meaning the shift toward longer system prompts, tool definitions, and agent frameworks has worsened the ratio even as per-token costs have fallen. Programming tasks now account for more than 50% of all token consumption on OpenRouter, up from 11% at the start of 2025, and a single AI coding agent can consume up to 20 million tokens on a smaller task.

Reasoning models introduce a separate layer of overhead. Chain-of-thought "thinking" tokens, invisible to the end user, now represent the majority of token volume on major inference platforms. The NoWait group (2025) found that 27% to 51% of these thinking tokens can be removed without compromising output accuracy, and subsequent work has confirmed the scale of this waste: ThinkPrune (2026) demonstrated a 50% reduction in reasoning length with only a 2% accuracy drop on AIME24, while REFRAIN achieved 20% to 55% fewer tokens while maintaining or exceeding baseline accuracy. State-of-the-art reasoners routinely consume more than 15,000 tokens for math problems solvable with a few hundred, and DeepSeek-R1 generates roughly 4,000 thinking tokens for simple coding questions where GPT-4o uses 150. Reasoning models still account for more than 50% of all token usage on OpenRouter, which makes this waste category the single largest contributor to the gap between throughput and outcome.

Speculative decoding, now standard in production inference stacks including vLLM, SGLang, and TensorRT-LLM, introduces its own waste stream. Draft tokens that the verifier model rejects represent pure energy expenditure with zero output. Acceptance rates in production range from 0.60 to 0.85 depending on task type, with creative and open-ended generation at the low end (35-50% rejection) and structured code generation at the high end (15-25% rejection). Below an acceptance rate of 0.55, speculative decoding provides marginal or negative net benefit, consuming more energy than sequential generation would require.

Agentic workflows compound all of these losses. A 30-tool agent consumes more than 21,000 tokens loading tool definitions before performing any work, and agentic systems require 5 to 30 times more tokens per task than standard chat interactions. One documented case involved an AI agent executing 47 iterations of the same failed database command, burning $30 in compute on a $0.50 problem. Recent research quantifies the opportunity: SupervisorAgent (2026) reduced multi-agent token costs by 35% and variance by 63%, while CodeAgents improved accuracy 24.4% while cutting token volume by 72.2% on HotpotQA. AgentDiet achieved input token reductions of 39.9% to 59.7%. Agent-driven workflows now generate more than half of all output tokens on OpenRouter, which means the fastest-growing category of AI usage is also the one with the most room for efficiency gains.

Evidence Table: Token Throughput vs. Cognitive Output

Metric	Value	Source	Confidence
▸Token Waste4 metrics
Input-to-output ratio	15:1 (6,000 in, 400 out); users see 6.3% of tokens processed	OpenRouter (20T tokens/week, 12.7x YoY)	High
Removable reasoning tokens	27-51% removable (NoWait); 50% with 2% accuracy loss (ThinkPrune); 20-55% (REFRAIN)	NoWait 2025, ThinkPrune 2026, REFRAIN 2025	High
Speculative decoding rejection	15-50% of draft tokens rejected depending on task type	NVIDIA, PremAI, BentoML	High
Agent token overhead	5-30x more tokens per task than chat; 35-72% reducible	SupervisorAgent 2026, CodeAgents 2026, AgentDiet 2025	Med
▸Output Quality4 metrics
Hallucination rate	0.7% (Gemini 2.0 Flash) to 33% (o3 on PersonQA)	Vectara Leaderboard, 2025	High
Human verification overhead	4.3 hours/week per knowledge worker verifying AI outputs	Microsoft, 2025	Med
AI code quality	4x more cloning; 32.7% PR acceptance (vs. 84.4% human); 1.7x more bugs	GitClear, 2025	High
Agentic AI project failure	40% of agentic AI projects canceled by end of 2027	Gartner, June 2025	High
▸Infrastructure4 metrics
GPU utilization (colocation)	30-50% average; hyperscalers peak at 60-70%	LBNL, Duke University, Google	Med
Datacenter energy	~460 TWh (2024), >1,000 TWh (2026), 945 TWh AI-specific by 2030	IEA "Energy and AI" 2025, IEA Electricity 2026	High
Capex (Big 4)	$700B in 2026 (Amazon $200B, Google $175-185B, Microsoft $145B, Meta $115-135B)	CNBC, Bloomberg, Q1 2026 earnings	High
Intelligence per watt	5.3x improvement (2023-2025): 3.1x model, 1.7x hardware	Stanford Hazy Research, Nov 2025	High
▸Enterprise Adoption1 metric
Production deployment	34% "deeply transforming" (Deloitte); 6% are high performers with >5% EBIT from AI (McKinsey)	Deloitte 2026, McKinsey 2026	High

What "Useful Cognitive Work" Means Operationally

A token has no intrinsic economic value. Its value is entirely a function of the task it contributes to completing, and most tokens contribute to nothing the end user requested. Defining "useful cognitive work" therefore requires stepping outside the inference pipeline and measuring what happens after the tokens are generated: did a customer inquiry get resolved, did a line of code survive review and reach production, did the medical image classification prove accurate, or did the legal document pass without requiring human rework?

The evidence on task-level economics is encouraging even as the aggregate efficiency picture is bleak. Klarna reported that its AI chatbot handles two-thirds of all customer inquiries, reducing cost per transaction from $0.32 to $0.19 and producing $40 million in verified cost savings during 2024, but the company subsequently began re-hiring human agents after concluding that customers need the option to speak with a person. JPMorgan Chase deploys 450 AI use cases in production (scaling toward 1,000), with AI-generated advertising copy delivering click-through rates up to 450% higher than human-written alternatives. GPT-4 completes contract reviews in under 5 minutes compared to 56 minutes for junior attorneys, a 99.97% cost reduction on a per-task basis for routine document review. Klarna's reversal is instructive: even when the throughput metrics showed clear savings, the outcome metric (customer satisfaction and resolution quality) required a different answer.

These task-level gains are real, but they represent the numerator of a fraction whose denominator, total energy and capital deployed, is growing faster. LLM inference prices have fallen at a median rate of 50x per year since 2023 according to Epoch AI, with GPT-3.5-level performance dropping roughly 300x in cost from 2023 to 2026. OpenAI reached $25 billion in annualized revenue by early 2026, up from $3.7 billion in 2025, but still projects a $14 billion loss for the year and does not expect to break even until 2030. Anthropic grew faster (reaching $19 billion in annualized revenue by March 2026, roughly 10x year-over-year) but also operates at a loss. The per-task cost is falling; the aggregate cost of operating the infrastructure that delivers those tasks continues to outpace revenue at both of the industry's largest inference providers.

The Historical Pattern

The airline industry's transition from throughput to outcome metrics took roughly a decade and was driven by a single regulatory event. Before the Airline Deregulation Act of 1978, carriers measured success in Available Seat Miles, a pure throughput metric that rewarded putting seats in the air regardless of whether anyone sat in them. Load factors hovered around 50%, meaning half of all airline capacity flew empty. After deregulation, the metric that mattered shifted to Revenue Passenger Miles and eventually to Revenue per Available Seat Mile (RASM), which captured not just whether seats were full but whether the revenue from filling them exceeded the cost of flying them. Bob Crandall, CEO of American Airlines, called the resulting yield management discipline "the single most important technical development in transportation management since we entered deregulation." Load factors climbed from 50% to over 85%, and inflation-adjusted yields fell 44.9% from 1978 to 2011 while airlines became more profitable, because they optimized for economic output rather than physical throughput.

Telecom experienced a parallel collapse. When mobile data revenue surpassed voice revenue in the United States in 2013, the industry's primary KPI, voice minutes, ceased to describe the business. Verizon formalized this in June 2012 by launching "Share Everything" plans that priced voice at zero and billed exclusively for data. Global mobile ARPU collapsed from $35 in 2000 to roughly $10 by 2025, yet the total economic value transacted over mobile networks grew by orders of magnitude, because the metric had shifted from minutes (throughput) to ecosystem value (outcome).

The Toyota Production System offers the manufacturing parallel. General Motors measured units produced per hour, which rewarded running assembly lines at maximum speed regardless of defect rates. Toyota, under Taiichi Ohno, measured the percentage of value-added work: what fraction of all activity is something a customer would pay for? The NUMMI joint venture demonstrated the consequence directly: at the same plant, using the same union workforce, Toyota's system required 19 labor-hours per car compared to GM's 31, with one-third the defect rate. GM had the throughput metric, measuring units-per-hour, inside its own organization for 15 years before acting on the outcome data.

The power generation industry provides the most direct analogy to AI infrastructure. Utilities historically measured installed capacity in megawatts, a throughput metric that rewarded building generation assets regardless of when or whether they dispatched. After electricity deregulation, particularly Texas Senate Bill 7 (1999) and the creation of ERCOT's energy-only market, the metric that mattered became revenue per megawatt-hour at time of dispatch. California's failure to make this transition contributed to the 2000-2001 energy crisis: the state had 45 GW of installed capacity against 28 GW of peak demand, apparently abundant throughput, but the market design failed to ensure dispatchable supply matched demand in real time, enabling $40 to $45 billion in damage through market manipulation, precisely because measuring capacity instead of economic dispatch left the system vulnerable to the mismatch between installed supply and dispatchable supply.

Factors Reducing Cognitive Yield (Estimated Impact on Useful Output per Watt)

Who Is Already Measuring Differently

Entity Spotlight: Outcome-Oriented Measurement

Stanford Hazy Research

Published "Intelligence Per Watt" framework (Nov 2025): task accuracy per unit of power, showing 5.3x improvement 2023-2025. Found hybrid routing cuts energy 60-80% while preserving quality.

Klarna

Measures AI in cost per transaction ($0.19, down 40%). $40M in savings verified, but re-hired human agents after outcome data showed customers need human access. Throughput metric said "cut staff"; outcome metric said "keep the option."

Amazon (Werner Vogels)

"Cost per business transaction" is the north-star metric, not cost per server-hour. Internally tracks cost per order across all infrastructure, including AI.

Taalas

Custom silicon (HC1) hard-wires Llama 3.1 8B into a chip: 17,000 tok/s per user at 10x lower power and 20x lower build cost than GPU alternatives. Built by 24 people with $30M of $200M raised. Eliminates GPU idle waste entirely by removing the general-purpose layer.

JPMorgan Chase

450 AI use cases in production, $18B annual tech budget. Measures AI by attributed business benefit (30-40% YoY growth), not by inference volume.

PrismML

1-bit Bonsai 8B: 14x smaller than full-precision equivalents (1.15 GB vs. 12-14 GB), runs at 44 tok/s on an iPhone at 0.068 mWh/token. Introduces "intelligence density" metric: 1.06/GB vs. Qwen3 8B's 0.10/GB. Reframes efficiency as capability per unit of model footprint.

Sequoia Capital

Framed the "$600B question": the annual revenue gap that must be filled to justify AI infrastructure investment. Explicitly measures infrastructure spending against economic return.

Comparison: Throughput Metrics vs. Outcome Metrics

Dimension	Tokens Per Watt (Throughput)	Cognitive Yield (Outcome)
Unit of measure	Tokens generated per watt-hour	Tasks completed or decisions improved per watt-hour
What it rewards	Faster inference, denser hardware, longer context	Fewer tokens per useful outcome, lower rework
Waste visibility	None; hallucinations, retries, and rejected tokens all count as output	Explicit; only verified, accepted output counts
Capital allocation signal	Build more capacity	Improve utilization and task-completion rate first
Historical parallel	ASM (airlines), voice minutes (telecom), units/hour (manufacturing)	RASM (airlines), ARPU (telecom), value-stream efficiency (manufacturing)
GPU utilization sensitivity	Indifferent; idle GPUs are invisible	Central; idle capacity directly dilutes yield
Reasoning overhead	Positive; more thinking tokens = more throughput	Negative unless thinking tokens improve task accuracy
Agent workflow treatment	Positive; agents generate enormous token volume	Neutral to negative; loop waste and tool-loading overhead are visible costs
Current adoption	Industry standard for infrastructure valuation	Experimental (Stanford IPW, Hugging Face AI Energy Score, PrismML)
Trade-off	Simple, measurable, comparable across vendors; hides economic waste	Reveals true efficiency; requires task-level instrumentation that most deployments lack

Assessment

MIXED: The Framework Exists but Adoption Is Near Zero

Stanford's Intelligence Per Watt metric, Hugging Face's AI Energy Score initiative, and PrismML's intelligence density metric demonstrate that outcome-oriented measurement is technically feasible across hardware, model, and deployment layers. The 5.3x IPW improvement from 2023 to 2025 proves that efficiency gains are real and accelerating, and Google reported a 44x reduction in total emissions per median Gemini Apps prompt over a single year of development. The fundamental obstacle is instrumentation: measuring cognitive yield requires knowing whether a task was completed successfully, which demands integration between the inference layer and the business process layer. Only 6% of organizations qualify as "AI high performers" with more than 5% of EBIT attributable to AI, according to McKinsey's 2026 survey, and fewer than one in five track KPIs for generative AI solutions at all. Klarna and JPMorgan Chase represent exceptions precisely because they built their AI systems around business outcomes from the outset.

NO: Current Infrastructure Investment Decisions Are Not Informed by Cognitive Yield

The $700 billion in 2026 AI capex from Amazon, Google, Meta, and Microsoft is allocated on the basis of projected token demand, GPU throughput benchmarks, and competitive positioning rather than useful cognitive work per unit of energy. Gartner projects that more than 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs and unclear business value, which is the predictable consequence of investing on throughput assumptions without outcome validation. J.P. Morgan estimates that annual data center funding needs will surge from $700 billion in 2026 to more than $1.4 trillion by 2030, yet the revenue required to justify even the current spend ($650 billion annually for a 10% return) has no clear path to materialization.

Key Terminology

Cognitive Yield

Useful real-world cognitive work extracted per unit of energy consumed by AI infrastructure.

Intelligence Per Watt (IPW)

Stanford metric: task accuracy per unit of power consumed. Improved 5.3x from 2023-2025.

Speculative Decoding

Inference optimization where a small draft model generates candidate tokens verified by the main model. Rejected drafts are wasted compute.

RASM

Revenue per Available Seat Mile. The airline metric that replaced ASM (throughput) to capture economic outcome.

Chain-of-Thought Overhead

Internal reasoning tokens generated by models like o3 and DeepSeek R1 that users never see. 27-51% can be removed without accuracy loss.

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure." Applied here: optimizing for tokens/watt distorts infrastructure investment decisions.

Jevons Paradox

When efficiency gains lower the cost of a resource, total consumption increases because demand outpaces efficiency. Visible in AI inference cost trends.

Economic Dispatch

Power grid concept: generating electricity when and where it is most economically valuable, not just generating the most possible.

Intelligence Density

PrismML metric: benchmark performance per gigabyte of model weight. Bonsai 8B scores 1.06/GB vs. Qwen3 8B's 0.10/GB. Measures capability per unit of compute footprint.

Model-Specific Silicon

Hardware that hard-wires a single model's architecture into custom chips, eliminating general-purpose overhead. Taalas HC1 achieves 10x lower power than GPU alternatives for Llama 3.1 8B.

Conclusions

Tokens per watt is to AI infrastructure what Available Seat Miles were to airlines before deregulation: a throughput metric that masks economic waste. Users see roughly 6.3% of all tokens processed, reasoning models burn 27 to 51% of their compute on removable thinking tokens (confirmed by ThinkPrune, REFRAIN, and NoWait), and agents consume 5 to 30 times more tokens per task than standard chat while generating more than half of all output tokens on OpenRouter. None of this waste is visible in the throughput metric that drives infrastructure investment.
The gap between per-token efficiency and aggregate capital efficiency is widening. Intelligence per watt improved 5.3x from 2023 to 2025, inference prices have fallen at a median rate of 50x per year, and yet total AI datacenter energy consumption doubled, annual capex reached $700 billion, and the industry's two largest inference providers both operate at a loss (OpenAI projects a $14 billion loss on $25 billion revenue in 2026; Anthropic is growing at 10x annually but remains unprofitable). Jevons paradox explains part of this divergence, but the throughput metric also obscures how much of the growing consumption produces economically valuable output.
Every capital-intensive industry that survived its buildout phase eventually adopted outcome metrics; AI's transition is overdue. Airlines needed deregulation to force the shift from ASM to RASM; telecom's transition came when data revenue eclipsed voice and minutes became irrelevant; manufacturing required the NUMMI joint venture to demonstrate that value-stream efficiency determined competitive survival more than raw units per hour. Gartner's prediction that 40% of agentic AI projects will be canceled by end of 2027 suggests the forcing function may arrive through project-level failures rather than a single regulatory event: when enough organizations discover that their throughput metrics promised productivity gains that the outcome metrics cannot confirm, the question "how many tokens per watt?" will give way to "how much useful work per dollar?"
The organizations best positioned for this transition are those already measuring cognitive work at the task level. Klarna (cost per transaction), JPMorgan Chase (attributed business benefit per use case), and Amazon (cost per order) have built outcome instrumentation into their AI deployments from the beginning, and emerging approaches from Taalas (model-specific silicon at 10x lower power) and PrismML (1-bit models at 0.068 mWh/token on mobile devices) are attacking the denominator from the hardware and architecture sides simultaneously. Stanford's IPW framework and Hugging Face's AI Energy Score provide the academic and community measurement foundations. The companies still measuring success in tokens processed, GPU utilization, or model benchmark scores will face the same reckoning GM faced when it finally examined NUMMI's data: the throughput metrics said the factory was productive, but the outcome metrics revealed it was wasting 39% more labor and producing three times as many defects as the outcome-optimized alternative operating under the same roof.

Sources: Stanford Hazy Research (Intelligence Per Watt, Nov 2025), OpenRouter/a16z (20T Tokens/Week, 2026), IEA (Energy and AI 2025; Electricity 2026), Sequoia Capital ($600B Question, 2024), J.P. Morgan (AI Infrastructure Returns Analysis, 2026), LBNL (US Data Center Energy Report, 2024), Vectara (Hallucination Leaderboard, 2025), NoWait (Reasoning Token Efficiency, arxiv 2025), ThinkPrune (arxiv 2026), REFRAIN (arxiv 2025), SupervisorAgent (arxiv 2026), CodeAgents (arxiv 2026), AgentDiet (arxiv 2025), Epoch AI (LLM Inference Price Trends), GitClear (AI Code Quality, 2025), Deloitte (State of AI in Enterprise, 2026), McKinsey (State of AI Trust, 2026), Gartner (Agentic AI Predictions, 2025), Microsoft (Knowledge Worker AI Survey, 2025), Klarna (AI Cost Savings and Re-Hiring, 2024-2026), JPMorgan Chase (AI Strategy, 2024-2026), Anthropic (Revenue Disclosures, 2026), OpenAI (Financial Projections, 2026), Taalas (HC1 Benchmarks, 2026), PrismML (Bonsai 8B, 2026), Hugging Face (AI Energy Score), Werner Vogels (The Frugal Architect, re:Invent 2023), FRED/DOT (Airline Load Factor History), CTIA (Wireless Industry Survey), EIA (California Duck Curve, Solar Curtailment Data), Bob Crandall (Yield Management, American Airlines), CNBC/Bloomberg (AI Capex Tracking, Q1 2026)