Zhipu AI's GLM-5.2 scores 62.1 on SWE-bench Pro, beating GPT-5.5 at one-sixth the API cost
SWE-bench Pro is the benchmark most practitioners treat as the real test of a coding model: it uses real GitHub issues, not hand-crafted tasks. Zhipu AI's GLM-5.2, released on June 16, 2026, scored 62.1 on that benchmark. GPT-5.5 scored 58.6. The model is open-weight, MIT-licensed, and priced at $1.40 per million input tokens through the Z.ai API.
What
Z.ai (the rebranded name for Zhipu AI) published weights for GLM-5.2 on Hugging Face on June 16, 2026 under the MIT license, with no usage restrictions. The model runs 753 billion total parameters in a Mixture-of-Experts architecture, with roughly 40 billion parameters active per token. Its context window is 1 million tokens.
Benchmark results from VentureBeat's coverage show GLM-5.2 outperforming GPT-5.5 across four coding-focused evaluations:
- SWE-bench Pro: 62.1 vs. GPT-5.5's 58.6
- FrontierSWE Dominance: 74.4% vs. GPT-5.5's 72.6% (Claude Opus 4.8 led at 75.1%)
- MCP-Atlas tool-use score: 77.0 vs. GPT-5.5's 75.3
- PostTrainBench extended engineering: 34.3% vs. GPT-5.5's 25.0%
On Terminal-Bench 2.1, the proprietary models maintain an edge: Claude Opus 4.8 scored 85.0, GPT-5.5 scored 84.0, and GLM-5.2 scored 81.0.
API pricing sits at $1.40 per million input tokens and $4.40 per million output tokens, per the Z.ai API docs linked from VentureBeat's piece. GPT-5.5 API pricing is roughly six times higher by VentureBeat's comparison table.
The model ships selectable thinking modes: "Max" pushes reasoning depth at the cost of roughly 85,000 output tokens per task; "High" trades a few benchmark points for roughly half the token output.
Why it matters
For developers choosing a model for agentic coding agents, GLM-5.2 changes the cost calculus. Until now, matching GPT-5.5's SWE-bench Pro performance required paying GPT-5.5 rates. That is no longer true. The model runs on Hugging Face under the MIT license, meaning teams can self-host it on their own compute, bypassing API rate limits and per-token billing entirely.
Latent Space (Swyx) called GLM-5.2 "the first open model to genuinely pass the frontier model vibe check," noting that GLM 5.0 and 5.1 both benchmaxxed without holding up in practice. The June 19 issue of AINews cited multiple independent developers corroborating the benchmark numbers with real-task results, which is the bar earlier GLM versions did not clear.
Semgrep's security research team went further: their June 2026 evaluation found GLM-5.2 outperforming Claude on their internal cybersecurity benchmarks, specifically on vulnerability-detection tasks. For teams running AI on security workflows, that is a direct evaluation signal from practitioners with a specific use case.
The availability of frontier-grade weights under MIT also carries a regulatory relevance. US export controls on Anthropic's Fable 5 models disrupted teams outside the US in June 2026. A high-performing open model from a Chinese lab, self-hostable anywhere, is a structural alternative for those teams.
What to watch next
Developer adoption over the next 30 days is the real test. SWE-bench Pro leads have not always held up in practice for Chinese open models. If GLM-5.2 displaces GPT-5.5 in community agentic-coding stacks (Cline provider lists, OpenClaw configs), that closes the gap in production, not just benchmarks. Z.ai also forecast "Open Fable," an open-weight model targeting Mythos-class capability, by December 2026. That is a vendor self-claim, but if delivered, it would be the first open-weight model at that capability tier from any lab.
Sources
- VentureBeat: Z.ai's open-weights GLM-5.2 beats GPT-5.5 on multiple long-horizon coding benchmarks for 1/6th the cost - primary coverage, June 16, 2026
- Latent Space AINews: GLM > GPT? GLM-5.2 passes vibe check; Z.ai forecasts Open Fable by December - primary coverage, June 19, 2026
- Semgrep: We have Mythos at Home - GLM 5.2 beats Claude in our Cyber Benchmarks - secondary, practitioner evaluation