MiniMax M3 vs GLM 5.2

DATARESULTS DATAEVAL SUITE

Thinkbench, our custom evaluation harness, was used to drive both models through the same autonomous coding loop: read files, write files, run shell commands, and stop when the task was complete. The scored suite covered greenfield builds, bug fixes, feature additions, and repair-to-green tasks. Hidden graders ran after each model stopped, using fixed-denominator behavior checks. We also included a separate ungraded dimension covering how the models handle ambiguously defined instructions, with briefs for systems such as audit logs, schedulers, feature flags, and notification hubs. For those tasks, we tracked implementation choices, API shape, scope control, and failure semantics.

72total tasks

60hidden-graded

12observed only

432final rows

1. Results

GLM was steadier. MiniMax was cheaper and faster.

Across the 60 scored tasks, GLM 5.2 finished with the stronger correctness profile: 92% full-pass and a 0.976 mean score. MiniMax M3 finished at 84% full-pass and a 0.961 mean score. Because the separation was modest, cost and latency will probably influence which model makes sense. MiniMax cost $6.67 for the scored runs against GLM's $18.47, and averaged 45 seconds per run against GLM's 80 seconds.

Overall scored result

60 tasks, 3 trials per model; costs are cache-aware final scored rows.

Reading it: GLM has the better correctness numbers; MiniMax has the lower operating cost and lower latency.

Overall scored rows

Model	Full-pass	Mean score	Avg latency	Avg tokens	Total cost
GLM 5.2	165 / 180 (92%)	0.976	80s	82,443	$18.47
MiniMax M3	152 / 180 (84%)	0.961	45s	135,060	$6.67

Both models could code. What seemed to matter more was where they started to need review: package shape, edge cases, API design, or judgment calls in a loose brief. On existing-code work, the models were almost indistinguishable: bug fixes, feature additions, and repair-to-green tasks all landed at 0.999 to 1.000 mean score. The hard part was building from an empty repo.

Full-pass rate by task type

The greenfield lane is where the benchmark actually separated them.

Reading it: GLM's edge is concentrated in implement tasks. Both models effectively saturated the modification tasks.

2. Separation

Differences were concentrated in greenfield builds.

Per task, 54 of 60 scored tasks landed within 0.1 mean score of each other. Greenfield builds accounted for all six larger gaps. It means the benchmark did not show one model broadly failing and the other broadly succeeding. It showed a narrow difference in from-scratch packaging, API design, and edge-case discipline.

Largest task gaps

Mean score over three trials; only the largest gaps are shown.

Reading it: GLM's biggest win was ticketflow; MiniMax's clearest wins were patchwise and migrato.

Where GLM looked stronger

ticketflow was the largest gap: GLM went 1.00 while MiniMax averaged 0.33. The issue was not that MiniMax could not reason through the assignment. Two trials used a package layout the grader could not import from the workspace root, so the hidden checks never reached the logic. GLM delivered the package shape consistently.

Where MiniMax pushed back

patchwise went the other way: MiniMax scored 1.00 while GLM averaged 0.62. GLM delivered code, but its failures were real implementation bugs: a name typo in one trial and trailing-newline diff handling in two others. MiniMax handled that fixture cleanly.

microapi was the second clear GLM win. MiniMax built a plausible framework, but missed a short-circuit middleware return in one variant and produced a bad route regex in another. migrato was a MiniMax win: it cleared the migration-contract checks where GLM lost points. The pattern I take from this: GLM was steadier at packaging and complete delivery; MiniMax could still beat it on individual hard builds.

3. Ambiguity

When tasks were vague, MiniMax filled in more system.

The observed tasks were different. They were deliberately vague, and I didn't score them because there was no single correct implementation. The point was to compare what each model decided to build when the prompt left room. In that phase, MiniMax M3 consistently added more production-shaped machinery. GLM 5.2 more often stayed closer to the plain reading of the brief.

auditlog

MiniMax added hash-chain verification, a query builder, an action decorator, and file permission hardening. GLM stayed flatter: hash chain, boolean verification, and direct query filters.

notifyhub

MiniMax built a notifier with priority fallback and a hard failure when no channel worked. GLM collected per-channel send results and returned a report.

formvalidate

MiniMax built conditional validation with a when(...) combinator. GLM explicitly declined that extra scope and kept the validator simpler.

scheduler

MiniMax chose a fixed one-second background tick. GLM used a heap and slept until the next scheduled event, which was more idle-efficient.

I'm comfortable calling MiniMax the more eager model in this set because that claim is backed by the artifacts, not by vibe. It repeatedly reached for locks, persistence, policy objects, fallback paths, decorators, and a broader, more reusable structure. That can be useful. It can also be too much. GLM's restraint is not automatically better either; sometimes it missed a useful abstraction that MiniMax supplied.

Observed tasks are evidence, not scores

Task	Open decision	MiniMax M3 tendency	GLM 5.2 tendency
jobflow	Dependency model	Object references, DFS cycle detection, locks, CLI	String references, Kahn topo-sort, bare module entry
ratelimit	Algorithm extras	Token bucket plus variable request cost and reset	Token bucket with one-token consumption
featureflags	Targeting architecture	Nested strategy hierarchy	Inline rules inside one evaluator
docsearch	Indexing shape	TF-IDF with disk-backed JSON CLI	TF-IDF with lazy posting-list cache

4. Method

How the benchmark was run.

Each model got the same task brief, and for non-greenfield tasks the same starter code. It worked through a read/write/run loop, then stopped. Only after that did the hidden grader get copied into the workspace. A scored run used a fixed denominator, so an unimportable package scored zero instead of quietly shrinking the number of checks.

Same workBoth models saw the same brief and starter files for each task.

Three trialsEvery scored task ran three times per model, with thinking disabled.

Hidden graderThe grader was copied in only after the model stopped editing.

Ambiguous specVague briefs were preserved and read, not forced into pass/fail scoring.

Client configuration

Parameter	MiniMax M3	GLM 5.2
Provider	Fireworks AI	Fireworks AI
Serving path	Serverless endpoint	Serverless endpoint
API shape	OpenAI-compatible chat completions	OpenAI-compatible chat completions
Endpoint	`https://api.fireworks.ai/inference/v1`	`https://api.fireworks.ai/inference/v1`
Model ID	`accounts/fireworks/models/minimax-m3`	`accounts/fireworks/models/glm-5p2`
Service tier	priority	priority
Thinking mode	none	none
Trials per task	3	3
Input price	$0.45 / Mtok	$2.10 / Mtok
Cached input price	$0.09 / Mtok	$0.39 / Mtok
Output price	$1.80 / Mtok	$6.60 / Mtok
Cached input share	73.36%	73.06%
Avg scored run latency	45.0s	79.6s

Latency is end-to-end scored-run wall-clock time, not provider network timing.

Those client settings matter for interpreting the cost numbers. Cost was cache-aware because autonomous coding loops resend conversation context; cached input was priced separately from uncached input and output. The total production spend for building the benchmark and results was $48.88 across 605 metered runs. The scored table above is narrower: it reports only the final scored rows.

5. Assessment

Where I'd use each model.

GLM 5.2 is the safer pick when the task is a hard from-scratch build and the result needs to arrive as a complete, runnable project. Its greenfield edge was the clearest difference in the scored set. It cost more and took longer, but it produced more full-pass runs where the task started from nothing.

MiniMax M3 is the value pick for a lot of worker traffic. It was much cheaper, faster, and effectively tied on existing-code tasks. If the work is a bug fix, feature addition, or repair-to-green loop under review, MiniMax looks strong enough to be the default worker.

I wouldn't make either one the top-level coordinator by default. The best shape is still a frontier coordinator or judge above them: GPT-5.5 or Claude Opus deciding what to delegate, checking the finished work, and rerunning narrow pieces when the answer looks wrong. These models make the worker layer much more serious, not the coordinator layer unnecessary.

Sources: Thinkbench evaluation harness, downloadable result bundle, and downloadable evaluation suite. Runner configuration: MiniMax M3 and GLM 5.2 on Fireworks AI serverless endpoints, priority tier, thinking disabled, three trials per task.