Stanford's Hazy Research lab measured a 5.3x improvement in "intelligence per watt" from 2023 to 2025, yet global AI datacenter energy consumption doubled over the same period and annual AI infrastructure capex from Amazon, Google, Meta, and Microsoft reached $700 billion in 2026. Per-token efficiency is improving faster than in any comparable technology transition, while aggregate cognitive output, measured against the capital deployed to produce it, shows signs of declining. The disconnect, as Stanford's own IPW researchers have noted, suggests that the industry's dominant metric, tokens per watt, captures hardware performance without capturing whether the output is useful.
"Tell me how you will measure me, and I will tell you how I will behave. If you measure me in an illogical way, do not complain about illogical behaviour."Eli Goldratt, Theory of Constraints
Sequoia Capital quantified the financial expression of this measurement gap in mid-2024, estimating a $600 billion annual revenue shortfall between AI infrastructure spending and the revenue that infrastructure would need to generate to justify itself. J.P. Morgan subsequently calculated that earning a 10% return on the current AI buildout would require $650 billion in annual revenue, equivalent to a $35 perpetual annual payment from every iPhone user or $180 from every Netflix subscriber. These figures use financial throughput (revenue) as the denominator. The deeper question is whether the physical throughput metric that drives infrastructure investment, tokens per watt, even correlates with the economic value being produced.
Every major industry that has matured through a capital-intensive buildout phase has eventually abandoned its initial throughput metric in favor of an outcome metric. Airlines replaced Available Seat Miles with Revenue per Available Seat Mile; telecom carriers abandoned circuit-switched minutes in favor of Average Revenue Per User; and manufacturing shifted from units-per-hour to value-stream efficiency. In each case, the throughput metric masked waste, rewarded overproduction, and directed capital toward capacity rather than capability. AI infrastructure is at the same inflection point, and the stakes, measured in both energy and capital, have never been higher for a technology this early in its deployment curve.
The gap between tokens generated and useful cognitive work delivered is not a single inefficiency but a stack of compounding losses, each invisible to the tokens-per-watt metric. OpenRouter now processes over 20 trillion tokens per week (a 12.7x year-over-year increase), and the average inference call consumes roughly 6,000 input tokens to produce 400 output tokens: a 15:1 ratio in which users see approximately 6.3% of all tokens processed. Early 2024 ratios were closer to 10:1, meaning the shift toward longer system prompts, tool definitions, and agent frameworks has worsened the ratio even as per-token costs have fallen. Programming tasks now account for more than 50% of all token consumption on OpenRouter, up from 11% at the start of 2025, and a single AI coding agent can consume up to 20 million tokens on a smaller task.
Reasoning models introduce a separate layer of overhead. Chain-of-thought "thinking" tokens, invisible to the end user, now represent the majority of token volume on major inference platforms. The NoWait group (2025) found that 27% to 51% of these thinking tokens can be removed without compromising output accuracy, and subsequent work has confirmed the scale of this waste: ThinkPrune (2026) demonstrated a 50% reduction in reasoning length with only a 2% accuracy drop on AIME24, while REFRAIN achieved 20% to 55% fewer tokens while maintaining or exceeding baseline accuracy. State-of-the-art reasoners routinely consume more than 15,000 tokens for math problems solvable with a few hundred, and DeepSeek-R1 generates roughly 4,000 thinking tokens for simple coding questions where GPT-4o uses 150. Reasoning models still account for more than 50% of all token usage on OpenRouter, which makes this waste category the single largest contributor to the gap between throughput and outcome.
Speculative decoding, now standard in production inference stacks including vLLM, SGLang, and TensorRT-LLM, introduces its own waste stream. Draft tokens that the verifier model rejects represent pure energy expenditure with zero output. Acceptance rates in production range from 0.60 to 0.85 depending on task type, with creative and open-ended generation at the low end (35-50% rejection) and structured code generation at the high end (15-25% rejection). Below an acceptance rate of 0.55, speculative decoding provides marginal or negative net benefit, consuming more energy than sequential generation would require.
Agentic workflows compound all of these losses. A 30-tool agent consumes more than 21,000 tokens loading tool definitions before performing any work, and agentic systems require 5 to 30 times more tokens per task than standard chat interactions. One documented case involved an AI agent executing 47 iterations of the same failed database command, burning $30 in compute on a $0.50 problem. Recent research quantifies the opportunity: SupervisorAgent (2026) reduced multi-agent token costs by 35% and variance by 63%, while CodeAgents improved accuracy 24.4% while cutting token volume by 72.2% on HotpotQA. AgentDiet achieved input token reductions of 39.9% to 59.7%. Agent-driven workflows now generate more than half of all output tokens on OpenRouter, which means the fastest-growing category of AI usage is also the one with the most room for efficiency gains.
| Metric | Value | Source | Confidence |
|---|---|---|---|
| ▸Token Waste4 metrics | |||
| ▸Output Quality4 metrics | |||
| ▸Infrastructure4 metrics | |||
| ▸Enterprise Adoption1 metric | |||
A token has no intrinsic economic value. Its value is entirely a function of the task it contributes to completing, and most tokens contribute to nothing the end user requested. Defining "useful cognitive work" therefore requires stepping outside the inference pipeline and measuring what happens after the tokens are generated: did a customer inquiry get resolved, did a line of code survive review and reach production, did the medical image classification prove accurate, or did the legal document pass without requiring human rework?
The evidence on task-level economics is encouraging even as the aggregate efficiency picture is bleak. Klarna reported that its AI chatbot handles two-thirds of all customer inquiries, reducing cost per transaction from $0.32 to $0.19 and producing $40 million in verified cost savings during 2024, but the company subsequently began re-hiring human agents after concluding that customers need the option to speak with a person. JPMorgan Chase deploys 450 AI use cases in production (scaling toward 1,000), with AI-generated advertising copy delivering click-through rates up to 450% higher than human-written alternatives. GPT-4 completes contract reviews in under 5 minutes compared to 56 minutes for junior attorneys, a 99.97% cost reduction on a per-task basis for routine document review. Klarna's reversal is instructive: even when the throughput metrics showed clear savings, the outcome metric (customer satisfaction and resolution quality) required a different answer.
These task-level gains are real, but they represent the numerator of a fraction whose denominator, total energy and capital deployed, is growing faster. LLM inference prices have fallen at a median rate of 50x per year since 2023 according to Epoch AI, with GPT-3.5-level performance dropping roughly 300x in cost from 2023 to 2026. OpenAI reached $25 billion in annualized revenue by early 2026, up from $3.7 billion in 2025, but still projects a $14 billion loss for the year and does not expect to break even until 2030. Anthropic grew faster (reaching $19 billion in annualized revenue by March 2026, roughly 10x year-over-year) but also operates at a loss. The per-task cost is falling; the aggregate cost of operating the infrastructure that delivers those tasks continues to outpace revenue at both of the industry's largest inference providers.
The airline industry's transition from throughput to outcome metrics took roughly a decade and was driven by a single regulatory event. Before the Airline Deregulation Act of 1978, carriers measured success in Available Seat Miles, a pure throughput metric that rewarded putting seats in the air regardless of whether anyone sat in them. Load factors hovered around 50%, meaning half of all airline capacity flew empty. After deregulation, the metric that mattered shifted to Revenue Passenger Miles and eventually to Revenue per Available Seat Mile (RASM), which captured not just whether seats were full but whether the revenue from filling them exceeded the cost of flying them. Bob Crandall, CEO of American Airlines, called the resulting yield management discipline "the single most important technical development in transportation management since we entered deregulation." Load factors climbed from 50% to over 85%, and inflation-adjusted yields fell 44.9% from 1978 to 2011 while airlines became more profitable, because they optimized for economic output rather than physical throughput.
Telecom experienced a parallel collapse. When mobile data revenue surpassed voice revenue in the United States in 2013, the industry's primary KPI, voice minutes, ceased to describe the business. Verizon formalized this in June 2012 by launching "Share Everything" plans that priced voice at zero and billed exclusively for data. Global mobile ARPU collapsed from $35 in 2000 to roughly $10 by 2025, yet the total economic value transacted over mobile networks grew by orders of magnitude, because the metric had shifted from minutes (throughput) to ecosystem value (outcome).
The Toyota Production System offers the manufacturing parallel. General Motors measured units produced per hour, which rewarded running assembly lines at maximum speed regardless of defect rates. Toyota, under Taiichi Ohno, measured the percentage of value-added work: what fraction of all activity is something a customer would pay for? The NUMMI joint venture demonstrated the consequence directly: at the same plant, using the same union workforce, Toyota's system required 19 labor-hours per car compared to GM's 31, with one-third the defect rate. GM had the throughput metric, measuring units-per-hour, inside its own organization for 15 years before acting on the outcome data.
The power generation industry provides the most direct analogy to AI infrastructure. Utilities historically measured installed capacity in megawatts, a throughput metric that rewarded building generation assets regardless of when or whether they dispatched. After electricity deregulation, particularly Texas Senate Bill 7 (1999) and the creation of ERCOT's energy-only market, the metric that mattered became revenue per megawatt-hour at time of dispatch. California's failure to make this transition contributed to the 2000-2001 energy crisis: the state had 45 GW of installed capacity against 28 GW of peak demand, apparently abundant throughput, but the market design failed to ensure dispatchable supply matched demand in real time, enabling $40 to $45 billion in damage through market manipulation, precisely because measuring capacity instead of economic dispatch left the system vulnerable to the mismatch between installed supply and dispatchable supply.
| Dimension | Tokens Per Watt (Throughput) | Cognitive Yield (Outcome) |
|---|---|---|
| Unit of measure | Tokens generated per watt-hour | Tasks completed or decisions improved per watt-hour |
| What it rewards | Faster inference, denser hardware, longer context | Fewer tokens per useful outcome, lower rework |
| Waste visibility | None; hallucinations, retries, and rejected tokens all count as output | Explicit; only verified, accepted output counts |
| Capital allocation signal | Build more capacity | Improve utilization and task-completion rate first |
| Historical parallel | ASM (airlines), voice minutes (telecom), units/hour (manufacturing) | RASM (airlines), ARPU (telecom), value-stream efficiency (manufacturing) |
| GPU utilization sensitivity | Indifferent; idle GPUs are invisible | Central; idle capacity directly dilutes yield |
| Reasoning overhead | Positive; more thinking tokens = more throughput | Negative unless thinking tokens improve task accuracy |
| Agent workflow treatment | Positive; agents generate enormous token volume | Neutral to negative; loop waste and tool-loading overhead are visible costs |
| Current adoption | Industry standard for infrastructure valuation | Experimental (Stanford IPW, Hugging Face AI Energy Score, PrismML) |
| Trade-off | Simple, measurable, comparable across vendors; hides economic waste | Reveals true efficiency; requires task-level instrumentation that most deployments lack |
Stanford's Intelligence Per Watt metric, Hugging Face's AI Energy Score initiative, and PrismML's intelligence density metric demonstrate that outcome-oriented measurement is technically feasible across hardware, model, and deployment layers. The 5.3x IPW improvement from 2023 to 2025 proves that efficiency gains are real and accelerating, and Google reported a 44x reduction in total emissions per median Gemini Apps prompt over a single year of development. The fundamental obstacle is instrumentation: measuring cognitive yield requires knowing whether a task was completed successfully, which demands integration between the inference layer and the business process layer. Only 6% of organizations qualify as "AI high performers" with more than 5% of EBIT attributable to AI, according to McKinsey's 2026 survey, and fewer than one in five track KPIs for generative AI solutions at all. Klarna and JPMorgan Chase represent exceptions precisely because they built their AI systems around business outcomes from the outset.
The $700 billion in 2026 AI capex from Amazon, Google, Meta, and Microsoft is allocated on the basis of projected token demand, GPU throughput benchmarks, and competitive positioning rather than useful cognitive work per unit of energy. Gartner projects that more than 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs and unclear business value, which is the predictable consequence of investing on throughput assumptions without outcome validation. J.P. Morgan estimates that annual data center funding needs will surge from $700 billion in 2026 to more than $1.4 trillion by 2030, yet the revenue required to justify even the current spend ($650 billion annually for a 10% return) has no clear path to materialization.