Defensible Deep Research with Open-Weight Models

I've been working on a custom harness for myself. Everyone's working on harness engineering today. But while working on mine, I became very interested in making sure it could get deep research right: effective, cheap, and trustworthy. Mostly I wanted to know: can I get the harness to delegate reading and drafting while keeping final claims tied to sources and checked for accuracy?

Architecture first The coordinator stays responsible for source selection, routing, and final judgment.

Cheap middle passes Long reading and drafting move to lower-cost open-weight workers where the work is bounded.

Verification boundary The check can catch unsupported claims when source text is present, but weak source material still has to be labeled.

Parts & Services

Reading and drafting move to open-weight workers; source tags, confidence tiers, and coordinator checks keep the output usable.

The first handoff

The first handoff keeps Gemma out of open-ended research. The coordinator searches, opens pages, and decides what belongs in the source set. The fetched material is flattened into a markdown source pack before it reaches the worker.

Search and select The coordinator handles live web search, source choice, and the first judgment about relevance.

Flatten to markdown Pages become a stable source pack with titles, URLs, source tags, and task boundaries.

Compress with Gemma Gemma gets a bounded reading job: preserve figures, qualifiers, and source tags.

I set it up this way because web search and source judgment are where I want the stronger model. The lower-cost worker gets a source pack and a compression task, not a blank research assignment.

First, just a bit about the harness itself. I think of it as a terminal workbench with six practical pieces.

Coordinator A frontier model owns the conversation, fetches sources, routes work, and makes the final call.

Worker registry Named lower-cost models are available for bounded reading, compression, and drafting jobs.

Delegation tools delegate_worker sends one briefdelegate_batch fans out independent briefs in parallel

Prompt fragments The run starts with instructions about available workers, when to delegate, and what must be verified.

Run log Handoffs and outputs are preserved so I can trace mistakes back to source selection, compression, synthesis, or verification.

Learning layer Reusable procedures are a design goal; no dynamic procedure builder is running in this article.

The research test case was a datacenter supply-chain briefing for 2026: transformers, power delivery, interconnection queues, 800 VDC architecture, packaging constraints, and weak signals around solid-state transformers.

Plumbing is straightforward: the coordinator fetches sources and decides what work can leave its own context, Gemma 4 31B-it compresses source shards in parallel, Nemotron 3 Ultra writes from those compressed notes, and the final pass returns to the coordinator for source-sensitive checks. The check is mostly mechanical: find the source sentence behind the number or claim, then keep, correct, or cut it.

Methodology & Confidence

The datacenter supply-chain report included more than the main analysis. It also included a method note explaining how sources moved through the chain and a confidence table separating corroborated claims from single-source or low-reliability ones. That helps readers see which parts of the analysis are strong and which parts need caution.

01 Fetch sources The coordinator gathers live source material and decides how to split the reading.

02 Compress notes Gemma preserves figures with subjects, source tags, and reliability hints.

03 Synthesize Nemotron writes from supplied notes and is told to keep gaps and hedges visible.

04 Verify claims The coordinator checks source-sensitive claims before the report is treated as usable.

The workflow does not make research deterministic. It adds friction where mistakes matter: figures carry source tags, weak evidence is labeled instead of smoothed over, and high-impact claims are checked before the report is treated as usable.

The Skepticism Survives

The report did not just produce a conclusion. It preserved the parts that still needed caution: weak sources, unresolved claims, and risks that could change the answer. That is what made it useful. It gives readers something they can act on more carefully, not just something that sounded finished.

Over-build risk

If the transformer constraint is order priority rather than true capacity, the roughly $1.8B North American expansion could overshoot into a softer 2028 market.

Unknown conversion

The 2,100+ GW interconnection queue is not the same as energized capacity, and the source set did not quantify how much converts by 2028.

Market-size fragility

The SST growth story leaned on single-firm CAGR estimates and unverified cost-curve figures, so the report kept those claims in the low-reliability tier.

Example: Confidence Tiering from a Fresh Report

Tier	Claims Placed There
Well-corroborated	Transformer lead-time elongation, pricing resets, and demand growth; roughly $1.8B in North American plant investments with named sites and dates; the 800 VDC shift; interconnection queues above 2,100 GW with three-to-seven-year timelines; CoWoS and HBM trajectory.
Single-source / analyst	Omdia's 30-50% slip projection; Wood Mackenzie survey specifics and China share estimate; Rolls-Royce revenue-share claims; Epoch AI packaging-versus-logic assessment; vendor technical claims from Wolfspeed; DG Matrix pipeline and priority-access claims.
Low reliability	SST market-size figures with wide disagreement across market reports; SiC price-decline and system-cost figures from a single unverified source; delay framing where the Omdia figure was the safer anchor.

Generated Artifact

A single-page report produced by the same research workflow. Excluding coordinator subscription time, the open-weight compression and synthesis passes cost roughly $0.008. The left capture shows the finished brief format; the right capture shows the method and confidence section that came back with the report.

Open full artifact

Top of generated datacenter supply-chain report

Finished HTML report

Methodology and confidence section of generated report

Methodology and confidence tiers

The Models in the Run

The model split was intentionally narrow. Gemma handled bounded compression jobs over fetched source shards. Nemotron handled the long-context synthesis from those notes. The coordinator stayed responsible for source selection, routing, and the final checks.

◆ Phase 1 · Compression

Gemma 4 31B-it FP8

Reads the fetched sources and compresses them into lean, source-tagged notes for the synthesizer.

Together · 256K ctx · $0.39 / $0.97 per 1M

multimodalfastfront-end

◆ Phase 2 · Synthesis

Nemotron 3 Ultra 550B-A55B NVFP4

Holds the compressed notes in one context and writes the cited analysis.

Fireworks · 262K ctx · $0.60 / $2.40 per 1M

long-contextMoEfinalizer

Bracket · Orchestration

Coordinator

Fetches live sources, runs the chain, and audits quantitative claims. The harness can swap coordinator models; these runs used Opus 4.8 and GPT-5.5.

Subscription · web-capable

frontierwebverify

Why the cost profile works: excluding coordinator subscription time, the open-weight compression and synthesis passes for the example report cost roughly $0.008. The expensive coordinator is mostly reserved for routing, source review, and verification; the long reading and drafting passes run on lower-cost endpoints.

For the datacenter report, I wanted a clean split between reading work and judgment work. The lower-cost passes could compress and draft from supplied material. The coordinator still had to decide what belonged in the source set and whether the final claims were strong enough to use.

3confidence tiers in the report

6risks and unknowns preserved

$0.008open-weight passes for the artifact

What Stayed Labeled

The datacenter report mixed very different kinds of evidence. Some claims were backed by multiple sources: transformer lead-time elongation, named North American plant investments, the 800 VDC shift, interconnection queue scale, and packaging constraints. Other claims were thinner: analyst projections, vendor technical claims, and market-size estimates for technologies that are not yet deployed at scale.

What the Check Catches

What I actually trust is more limited: the workflow is useful when it keeps those differences visible. The verification pass does not turn a single analyst estimate into a settled fact. It ties the claim to its source, preserves the hedge, and keeps weak evidence out of the high-confidence tier.

Datacenter Report Checks

Pattern	What showed up	What the harness learned
Corroborated constraints	Transformer lead times, named North American plant investments, 800 VDC movement, interconnection queues above 2,100 GW, and packaging constraints had enough support to carry the main thesis.	Let well-sourced claims anchor the report, but keep the source tags close to the numbers.
Single-source claims	Omdia's 30-50% slip projection, Wood Mackenzie survey specifics, Rolls-Royce revenue-share claims, and vendor technical claims were useful but not strong enough to flatten into settled fact.	Use those claims as analyst or vendor evidence, and label them that way in the output.
Low-reliability signals	Solid-state transformer market-size figures, SiC price-decline estimates, and system-cost estimates disagreed across thin market-report sources.	Preserve the uncertainty instead of letting fragile numbers become the backbone of the argument.

That is the behavior I want from the workflow. It does not bury low-confidence figures, and it does not make the report sound more settled than the source set supports. For a topic like datacenter supply-chain risk, that is the difference between a useful briefing and a summary that only sounds complete.

Assessment

Can You Trust It?

YES: The Architecture

With the coordinator verifying, the chain can produce checked, cited analysis that readers can inspect against the source material. In this run, the important numeric claims trace back to fetched pages, which is the standard the rest of this site is held to.

MIXED: The Synthesizer

Nemotron 3 Ultra is useful for structure, argument, and prose, but its numbers stay provisional until the coordinator checks them against the source text. I treat it as a drafter, not as the authority on the facts.

NO: Unattended

Without the verification pass, I would not treat the raw open-weight output as publishable. The report is useful because the coordinator ties final claims to fetched sources and keeps weak claims labeled.

Key Concepts

Two-phase chain Compression by Gemma followed by synthesis by Nemotron, bracketed by a coordinator that fetches and verifies.

Coordinator The swappable, web-capable model that fetches sources, orchestrates the phases, verifies, and cites.

Compression front-end The phase that reduces fetched pages to lean, source-tagged notes for the synthesizer.

Synthesis finalizer The long-context phase that writes the cited analysis from the notes.

Confidence tier A label that separates corroborated claims from single-source claims and low-reliability signals.

Risk register The part of the report that keeps thesis-breaking risks and unresolved questions visible.

Verification pass The coordinator's audit of quantitative claims in the synthesis against the fetched source text.

Conclusions

Lessons from the Workbench

A verifying coordinator makes low-cost open-weight synthesis usable for serious research. The architecture, not the synthesizer, is the product. A frontier model auditing quantitative claims against fetched sources can help a sub-cent open-weight chain produce reports that readers can inspect and challenge.

The useful output is the answer plus the caveats. The datacenter report mattered because it kept weak sources, unresolved claims, and thesis-breaking risks visible instead of smoothing them into one confident answer.

Grounding needs source quality, not just citations. A claim tied to a source is not automatically strong. The workflow works better when it distinguishes corroborated evidence, analyst estimates, vendor claims, and fragile market signals.

The economics are real when the worker jobs are bounded. The open-weight compression and synthesis passes for the example report cost roughly $0.008, because the expensive coordinator stayed focused on routing, source judgment, and verification.

Sources: Datacenter supply-chain report run, June 2026; fetched source pack and confidence table from the generated report, including Wood Mackenzie, Power Magazine, NVIDIA, Vertiv, Open Compute Project, Delta, Omdia, Epoch AI, and vendor materials cited in the report.