Thinkbench, our custom evaluation harness, was used to drive both models through the same autonomous coding loop: read files, write files, run shell commands, and stop when the task was complete. The scored suite covered greenfield builds, bug fixes, feature additions, and repair-to-green tasks. Hidden graders ran after each model stopped, using fixed-denominator behavior checks. We also included a separate ungraded dimension covering how the models handle ambiguously defined instructions, with briefs for systems such as audit logs, schedulers, feature flags, and notification hubs. For those tasks, we tracked implementation choices, API shape, scope control, and failure semantics.
GLM was steadier. MiniMax was cheaper and faster.
Across the 60 scored tasks, GLM 5.2 finished with the stronger correctness profile: 92% full-pass and a 0.976 mean score. MiniMax M3 finished at 84% full-pass and a 0.961 mean score. The gap was real, but narrow enough that price and latency matter. MiniMax cost $6.67 for the scored runs against GLM's $18.47, and averaged 45 seconds per run against GLM's 80 seconds.
| Model | Full-pass | Mean score | Avg latency | Avg tokens | Total cost |
|---|---|---|---|---|---|
| GLM 5.2 | 165 / 180 (92%) | 0.976 | 80s | 82,443 | $18.47 |
| MiniMax M3 | 152 / 180 (84%) | 0.961 | 45s | 135,060 | $6.67 |
Both models could code. What seemed to matter more was where they started to need review: package shape, edge cases, API design, or judgment calls in a loose brief. On existing-code work, the models were almost indistinguishable: bug fixes, feature additions, and repair-to-green tasks all landed at 0.999 to 1.000 mean score. The hard part was building from an empty repo.
The task-level gaps were specific, not global.
Per task, 54 of 60 scored tasks landed within 0.1 mean score of each other. The six larger gaps were all greenfield builds. That matters. It means the benchmark did not show one model broadly failing and the other broadly succeeding. It showed a narrow difference in from-scratch packaging, API design, and edge-case discipline.
Where GLM looked stronger
ticketflow was the largest gap: GLM went 1.00 while MiniMax averaged 0.33. The issue was not that MiniMax could not reason through the assignment. Two trials used a package layout the grader could not import from the workspace root, so the hidden checks never reached the logic. GLM delivered the package shape consistently.
Where MiniMax pushed back
patchwise went the other way: MiniMax scored 1.00 while GLM averaged 0.62. GLM delivered code, but its failures were real implementation bugs: a name typo in one trial and trailing-newline diff handling in two others. MiniMax handled that fixture cleanly.
microapi was the second clear GLM win. MiniMax built a plausible framework, but missed a short-circuit middleware return in one variant and produced a bad route regex in another. migrato was a MiniMax win: it cleared the migration-contract checks where GLM lost points. The pattern I take from this: GLM was steadier at packaging and complete delivery; MiniMax could still beat it on individual hard builds.
When the brief was vague, MiniMax filled in more system.
The observed tasks were different. They were deliberately vague, and I didn't score them because there was no single correct implementation. The point was to compare what each model decided to build when the prompt left room. In that phase, MiniMax M3 consistently added more production-shaped machinery. GLM 5.2 more often stayed closer to the plain reading of the brief.
MiniMax added hash-chain verification, a query builder, an action decorator, and file permission hardening. GLM stayed flatter: hash chain, boolean verification, and direct query filters.
MiniMax built a notifier with priority fallback and a hard failure when no channel worked. GLM collected per-channel send results and returned a report.
MiniMax built conditional validation with a when(...) combinator. GLM explicitly declined that extra scope and kept the validator simpler.
MiniMax chose a fixed one-second background tick. GLM used a heap and slept until the next scheduled event, which was more idle-efficient.
I'm comfortable calling MiniMax the more eager model in this set because that claim is backed by the artifacts, not by vibe. It repeatedly reached for locks, persistence, policy objects, fallback paths, decorators, and extensible strategy shapes. That can be useful. It can also be too much. GLM's restraint is not automatically better either; sometimes it missed a useful abstraction that MiniMax supplied.
| Task | Open decision | MiniMax M3 tendency | GLM 5.2 tendency |
|---|---|---|---|
| jobflow | Dependency model | Object references, DFS cycle detection, locks, CLI | String references, Kahn topo-sort, bare module entry |
| ratelimit | Algorithm extras | Token bucket plus variable request cost and reset | Token bucket with one-token consumption |
| featureflags | Targeting architecture | Nested strategy hierarchy | Inline rules inside one evaluator |
| docsearch | Indexing shape | TF-IDF with disk-backed JSON CLI | TF-IDF with lazy posting-list cache |
Enough method to trust the comparison.
Each model got the same task brief, and for non-greenfield tasks the same starter code. It worked through a read/write/run loop, then stopped. Only after that did the hidden grader get copied into the workspace. A scored run used a fixed denominator, so an unimportable package scored zero instead of quietly shrinking the number of checks.
| Parameter | MiniMax M3 | GLM 5.2 |
|---|---|---|
| Provider | Fireworks AI | Fireworks AI |
| Serving path | Serverless endpoint | Serverless endpoint |
| API shape | OpenAI-compatible chat completions | OpenAI-compatible chat completions |
| Endpoint | https://api.fireworks.ai/inference/v1 | https://api.fireworks.ai/inference/v1 |
| Model ID | accounts/fireworks/models/minimax-m3 | accounts/fireworks/models/glm-5p2 |
| Service tier | priority | priority |
| Thinking mode | none | none |
| Trials per task | 3 | 3 |
| Input price | $0.45 / Mtok | $2.10 / Mtok |
| Cached input price | $0.09 / Mtok | $0.39 / Mtok |
| Output price | $1.80 / Mtok | $6.60 / Mtok |
| Cached input share | 73.36% | 73.06% |
| Avg scored run latency | 45.0s | 79.6s |
Those client settings matter for interpreting the cost numbers. Cost was cache-aware because autonomous coding loops resend conversation context; cached input was priced separately from uncached input and output. The total production spend for building the benchmark and results was $48.88 across 605 metered runs. The scored table above is narrower: it reports only the final scored rows.
Where I'd use each model.
GLM 5.2 is the safer pick when the task is a hard from-scratch build and package delivery matters. Its greenfield edge was the clearest difference in the scored set. It cost more and took longer, but it produced more full-pass runs where the task started from nothing.
MiniMax M3 is the value pick for a lot of worker traffic. It was much cheaper, faster, and effectively tied on existing-code tasks. If the work is a bug fix, feature addition, or repair-to-green loop under review, MiniMax looks strong enough to be the default worker.
I wouldn't make either one the top-level coordinator by default. The best shape is still a frontier coordinator or judge above them: GPT-5.5 or Claude Opus deciding what to delegate, checking the finished work, and rerunning narrow pieces when the answer looks wrong. These models make the worker layer much more serious, not the coordinator layer unnecessary.