The public chatter on these two open-weight coders is that they are close: a near-tie on SWE-bench Verified, with Qwen3.7 Max reading slightly stronger on agentic and terminal work. To get a read on our own stack rather than someone else's leaderboard, Thinkwright's custom harness let both models drive its autonomous agent_loop, the same read/write/run-tests machinery a delegated coding worker uses, on four self-contained tasks, each graded by a hidden test the model never sees. Both models solved all four. The separation showed up not in correctness but in speed, cost, and how much work each spent getting there.
What the run showed
MiniMax M3 was faster on every task and cheaper on every task. Qwen3.7 Max reached the same answers with fewer tokens and fewer tool calls. The two findings sit together because they measure different things: Qwen ran a tighter trajectory, while MiniMax generated more but ran on faster, cheaper infrastructure. For a workflow that leans on open-weight models to do the coding under frontier review, the run favors MiniMax M3 as the default, with one honest qualifier on cost covered below.
MiniMax M3 · Fireworks
Qwen3.7 Max · Together
How the benchmark ran
Each model drove the harness's autonomous agent_loop: it reads files, writes files, and runs shell commands across many turns, then stops. Every task ran in a fresh isolated workspace seeded with starter code and a visible test_public.py. The grading test, grade.py, was copied in only after the model stopped, so the agent never saw the cases it was scored on. A run counts as solved only if that hidden grader exits clean. The four tasks were chosen for diversity across the skills the public benchmarks probe:
| Task | Type | What it exercises |
|---|---|---|
| lru_ttl | Implement from spec | An LRU cache with per-entry TTL: recency tracking, expiry, eviction edge cases. |
| stats_fix | Debug and fix | An off-by-one in a sliding-window function, located from a failing test (SWE-bench-style). |
| eventbus_feature | Feature add | A fire-once subscription added to a working pub/sub bus without breaking existing behavior. |
| csv_cli | Terminal tool | A CSV group-and-sum CLI, graded by running the program (terminal-bench-style). |
Five numbers were captured per run: whether the hidden grader passed, wall-clock latency, total tokens, cost priced from each provider's published rate, and the number of tool calls the agent made. The two workers carry different prices, which matters for reading the cost result: MiniMax M3 bills $0.45 per million input tokens and $1.80 per million output on Fireworks; Qwen3.7 Max bills $1.25 and $3.75 on Together, roughly 2.7 times more per token.
Per-run and per-worker
| Task | Worker | Solved | Latency | Tokens | Cost | Calls |
|---|---|---|---|---|---|---|
| csv_cli | MiniMax M3 | ✓ | 19.4s | 13,974 | $0.0074 | 7 |
| csv_cli | Qwen3.7 Max | ✓ | 21.7s | 8,273 | $0.0120 | 6 |
| eventbus_feature | MiniMax M3 | ✓ | 12.3s | 10,375 | $0.0059 | 5 |
| eventbus_feature | Qwen3.7 Max | ✓ | 22.5s | 6,788 | $0.0105 | 4 |
| lru_ttl | MiniMax M3 | ✓ | 16.5s | 11,800 | $0.0067 | 5 |
| lru_ttl | Qwen3.7 Max | ✓ | 26.4s | 8,213 | $0.0130 | 4 |
| stats_fix | MiniMax M3 | ✓ | 7.3s | 9,288 | $0.0048 | 5 |
| stats_fix | Qwen3.7 Max | ✓ | 16.8s | 6,102 | $0.0088 | 4 |
| Metric | MiniMax M3 | Qwen3.7 Max |
|---|---|---|
| Solved | 4 / 4 | 4 / 4 |
| Avg latency | 13.9 s | 21.8 s |
| Total tokens | 45,437 | 29,376 |
| Total cost | $0.0249 | $0.0443 |
| Total tool calls | 22 | 18 |
| Blended $/Mtok | ~$0.55 | ~$1.51 |
Four dimensions, four different stories
Correctness was a tie, and the tie is the least informative result. Both models solved all four tasks, so this run did not separate them on the dimension the public SWE-bench Verified numbers actually measure. The tasks were designed to be solvable, and both cleared them, which is a ceiling effect: to see a correctness gap, the benchmark needs harder or more failure-prone tasks. What the parity does confirm is that both models can drive the harness's autonomous agent_loop end to end, reading, editing, running tests, and fixing, without getting lost.
Speed favored MiniMax M3 consistently, not by luck of one task. It finished faster on all four, from a 2.3-second margin on csv_cli to a 9.9-second margin on lru_ttl, averaging 13.9 seconds against 21.8. Because the advantage held across four independent tasks rather than resting on a single outlier, the direction is more trustworthy than a single A-vs-B timing would be, though it is still one sample per task and includes provider queueing and network time.
Cost favored MiniMax M3 on every task, but the reason is mostly pricing. MiniMax used more tokens than Qwen on all four (45,437 against 29,376 in total) yet cost less ($0.0249 against $0.0443) because Fireworks prices it at roughly a third of Together's rate for Qwen. Blended, MiniMax cost about $0.55 per million tokens to Qwen's $1.51. The honest reading is that this is a provider-and-price result as much as a model result: a token-for-token comparison would favor Qwen, which spent fewer of them.
Token and tool-call economy favored Qwen3.7 Max. It reached the same solutions with about 35% fewer tokens and one fewer tool call on every task (18 to 22 in total). That leaner trajectory is the dimension where the public "stronger on agentic and terminal" read shows up here: Qwen does more per turn and talks less. In this run it did not convert into faster or cheaper outcomes, but on tasks where context budget is tight or each tool round-trip is expensive, that economy is where Qwen would pull ahead.
| Finding | Conf. |
|---|---|
| MiniMax M3 was faster on all four tasks (avg 13.9s vs 21.8s). | High |
| MiniMax M3 cost less on all four tasks, driven substantially by Fireworks vs Together pricing. | High |
| Qwen3.7 Max used fewer tokens and fewer tool calls on every task. | High |
| The speed advantage generalizes beyond this run. | Medium |
| The two models are equal on correctness for real work. | Low |
| Qwen's leaner trajectory means stronger agentic ability. | Low |
For Thinkwright's open-weight coding lane
Equal correctness, faster on every task, and cheaper on every task. For a workflow that pushes implementation onto an open-weight worker under frontier review, lower latency and lower cost at equal pass rate is the combination that matters, and MiniMax M3 held all three.
Its fewer tokens and fewer tool calls are a genuine edge in trajectory economy, valuable when context is scarce or per-call cost dominates. It just did not translate into a faster or cheaper result on these tasks, and its higher per-token price worked against it.
Both models passed every task, so the SWE-bench-style "which one is actually more correct on difficult fixes" question is untouched here. That separation needs harder tasks with real failure rates; treat the correctness parity as "both can drive the harness's autonomous agent_loop," not "both are equally capable."
What this benchmark is not
This is a small, single-run, in-house benchmark, directional rather than definitive. Four tasks, one run each, means no variance estimate, and latency in particular is a single sample that includes provider queueing. Every task was solved, so the result has a correctness ceiling and cannot rank the models on difficulty. The cost comparison is confounded by provider pricing rather than isolating model efficiency. The fixtures are small, self-contained Python problems run on Thinkwright's custom harness, not SWE-bench or Terminal-Bench, so the absolute numbers are not comparable to the published leaderboards; what they measure is relative behavior inside the loop these models actually run in for this stack. The real work I run through the harness is larger and mostly Rust, where compile times and longer edit-test-fix loops could reorder these results.
Takeaways
agent_loop. Four of four each, reading, editing, running tests, and fixing without supervision. For the goal of leaning on open-weight workers for code generation, that baseline competence is the precondition, and it held.agent_bench run, 2026-06-15 (4 tasks × minimax-m3, qwen-max via autonomous agent_loop; hidden-grader scoring). Worker pricing from the harness seed registry (MiniMax M3 on Fireworks $0.45/$1.80 per Mtok; Qwen3.7 Max on Together $1.25/$3.75 per Mtok). Public context: Artificial Analysis and Perplexity early reads on SWE-bench Verified, SWE-bench Pro, Terminal-Bench 2.0, and MCP-Atlas.