The public chatter on these two open-weight coders is that they are close: a near-tie on SWE-bench Verified, with Qwen3.7 Max reading slightly stronger on agentic and terminal work. To get a read on our own stack rather than someone else's leaderboard, Thinkwright's custom harness let both models drive its autonomous agent_loop, the same read/write/run-tests machinery a delegated coding worker uses, on four self-contained tasks, each graded by a hidden test the model never sees. Both models solved all four. The separation showed up not in correctness but in speed, cost, and how much work each spent getting there.

What the run showed

8 / 8tasks solved (4 of 4 each)
1.6×MiniMax M3 average speed edge
−44%MiniMax M3 total cost vs Qwen

MiniMax M3 was faster on every task and cheaper on every task. Qwen3.7 Max reached the same answers with fewer tokens and fewer tool calls. The two findings sit together because they measure different things: Qwen ran a tighter trajectory, while MiniMax generated more but ran on faster, cheaper infrastructure. For a workflow that leans on open-weight models to do the coding under frontier review, the run favors MiniMax M3 as the default, with one honest qualifier on cost covered below.

MiniMax M3 · Fireworks

solved 4 / 4
avg latency 13.9 s
total tokens 45,437
total cost $0.0249
tool calls 22

Qwen3.7 Max · Together

solved 4 / 4
avg latency 21.8 s
total tokens 29,376
total cost $0.0443
tool calls 18

How the benchmark ran

Each model drove the harness's autonomous agent_loop: it reads files, writes files, and runs shell commands across many turns, then stops. Every task ran in a fresh isolated workspace seeded with starter code and a visible test_public.py. The grading test, grade.py, was copied in only after the model stopped, so the agent never saw the cases it was scored on. A run counts as solved only if that hidden grader exits clean. The four tasks were chosen for diversity across the skills the public benchmarks probe:

The four tasks
TaskTypeWhat it exercises
lru_ttlImplement from specAn LRU cache with per-entry TTL: recency tracking, expiry, eviction edge cases.
stats_fixDebug and fixAn off-by-one in a sliding-window function, located from a failing test (SWE-bench-style).
eventbus_featureFeature addA fire-once subscription added to a working pub/sub bus without breaking existing behavior.
csv_cliTerminal toolA CSV group-and-sum CLI, graded by running the program (terminal-bench-style).

Five numbers were captured per run: whether the hidden grader passed, wall-clock latency, total tokens, cost priced from each provider's published rate, and the number of tool calls the agent made. The two workers carry different prices, which matters for reading the cost result: MiniMax M3 bills $0.45 per million input tokens and $1.80 per million output on Fireworks; Qwen3.7 Max bills $1.25 and $3.75 on Together, roughly 2.7 times more per token.

Per-run and per-worker

Per run (all eight)
TaskWorkerSolvedLatencyTokensCostCalls
csv_cliMiniMax M319.4s13,974$0.00747
csv_cliQwen3.7 Max21.7s8,273$0.01206
eventbus_featureMiniMax M312.3s10,375$0.00595
eventbus_featureQwen3.7 Max22.5s6,788$0.01054
lru_ttlMiniMax M316.5s11,800$0.00675
lru_ttlQwen3.7 Max26.4s8,213$0.01304
stats_fixMiniMax M37.3s9,288$0.00485
stats_fixQwen3.7 Max16.8s6,102$0.00884
Per worker (totals and averages)
MetricMiniMax M3Qwen3.7 Max
Solved4 / 44 / 4
Avg latency13.9 s21.8 s
Total tokens45,43729,376
Total cost$0.0249$0.0443
Total tool calls2218
Blended $/Mtok~$0.55~$1.51

Four dimensions, four different stories

Correctness was a tie, and the tie is the least informative result. Both models solved all four tasks, so this run did not separate them on the dimension the public SWE-bench Verified numbers actually measure. The tasks were designed to be solvable, and both cleared them, which is a ceiling effect: to see a correctness gap, the benchmark needs harder or more failure-prone tasks. What the parity does confirm is that both models can drive the harness's autonomous agent_loop end to end, reading, editing, running tests, and fixing, without getting lost.

Speed favored MiniMax M3 consistently, not by luck of one task. It finished faster on all four, from a 2.3-second margin on csv_cli to a 9.9-second margin on lru_ttl, averaging 13.9 seconds against 21.8. Because the advantage held across four independent tasks rather than resting on a single outlier, the direction is more trustworthy than a single A-vs-B timing would be, though it is still one sample per task and includes provider queueing and network time.

Cost favored MiniMax M3 on every task, but the reason is mostly pricing. MiniMax used more tokens than Qwen on all four (45,437 against 29,376 in total) yet cost less ($0.0249 against $0.0443) because Fireworks prices it at roughly a third of Together's rate for Qwen. Blended, MiniMax cost about $0.55 per million tokens to Qwen's $1.51. The honest reading is that this is a provider-and-price result as much as a model result: a token-for-token comparison would favor Qwen, which spent fewer of them.

Token and tool-call economy favored Qwen3.7 Max. It reached the same solutions with about 35% fewer tokens and one fewer tool call on every task (18 to 22 in total). That leaner trajectory is the dimension where the public "stronger on agentic and terminal" read shows up here: Qwen does more per turn and talks less. In this run it did not convert into faster or cheaper outcomes, but on tasks where context budget is tight or each tool round-trip is expensive, that economy is where Qwen would pull ahead.

Findings, with confidence
FindingConf.
MiniMax M3 was faster on all four tasks (avg 13.9s vs 21.8s).High
MiniMax M3 cost less on all four tasks, driven substantially by Fireworks vs Together pricing.High
Qwen3.7 Max used fewer tokens and fewer tool calls on every task.High
The speed advantage generalizes beyond this run.Medium
The two models are equal on correctness for real work.Low
Qwen's leaner trajectory means stronger agentic ability.Low

For Thinkwright's open-weight coding lane

YES: MiniMax M3 is the better default coding worker here

Equal correctness, faster on every task, and cheaper on every task. For a workflow that pushes implementation onto an open-weight worker under frontier review, lower latency and lower cost at equal pass rate is the combination that matters, and MiniMax M3 held all three.

MIXED: Qwen3.7 Max's case is real but narrow

Its fewer tokens and fewer tool calls are a genuine edge in trajectory economy, valuable when context is scarce or per-call cost dominates. It just did not translate into a faster or cheaper result on these tasks, and its higher per-token price worked against it.

NO: this run did not test the hard-correctness question

Both models passed every task, so the SWE-bench-style "which one is actually more correct on difficult fixes" question is untouched here. That separation needs harder tasks with real failure rates; treat the correctness parity as "both can drive the harness's autonomous agent_loop," not "both are equally capable."

What this benchmark is not

This is a small, single-run, in-house benchmark, directional rather than definitive. Four tasks, one run each, means no variance estimate, and latency in particular is a single sample that includes provider queueing. Every task was solved, so the result has a correctness ceiling and cannot rank the models on difficulty. The cost comparison is confounded by provider pricing rather than isolating model efficiency. The fixtures are small, self-contained Python problems run on Thinkwright's custom harness, not SWE-bench or Terminal-Bench, so the absolute numbers are not comparable to the published leaderboards; what they measure is relative behavior inside the loop these models actually run in for this stack. The real work I run through the harness is larger and mostly Rust, where compile times and longer edit-test-fix loops could reorder these results.

Takeaways

1
Both open-weight models can drive the harness's autonomous agent_loop. Four of four each, reading, editing, running tests, and fixing without supervision. For the goal of leaning on open-weight workers for code generation, that baseline competence is the precondition, and it held.
2
MiniMax M3 is the stronger default on this evidence. Faster and cheaper on every task at equal pass rate. The cost edge leans on Fireworks pricing, but speed and pass rate are model-level, and both favored MiniMax.
3
Qwen3.7 Max trades tokens for price. It runs the leaner trajectory, fewer tokens and fewer calls, but its higher per-token rate erases that thrift in total cost here. Its advantage is real where context pressure or per-call latency is the binding constraint, not raw throughput.
4
The next benchmark needs harder tasks. A clean 8-for-8 means the bench did not test the correctness question the public numbers care about. Adding failure-prone tasks, multiple runs, and at least one Rust fixture would turn this directional read into something that can actually rank the two on capability.
Sources: Thinkwright custom harness agent_bench run, 2026-06-15 (4 tasks × minimax-m3, qwen-max via autonomous agent_loop; hidden-grader scoring). Worker pricing from the harness seed registry (MiniMax M3 on Fireworks $0.45/$1.80 per Mtok; Qwen3.7 Max on Together $1.25/$3.75 per Mtok). Public context: Artificial Analysis and Perplexity early reads on SWE-bench Verified, SWE-bench Pro, Terminal-Bench 2.0, and MCP-Atlas.