← Back to Blog

GLM 5.2: What Z.ai's Open-Weight Flagship Actually Ships (and Costs)

GLM 5.2: What Z.ai's Open-Weight Flagship Actually Ships (and Costs)

Someone in a Discord I follow asked a simple question last week: "Is GLM 5.2 a real thing or did I dream it?" Fair question. The release moved fast, the version numbers from Z.ai now arrive every few weeks, and half the benchmark numbers floating around belong to a different model entirely. So let me sort out what's actually true, what you can run on your own boxes, and where the marketing gets ahead of the receipts.

Short answer: yes, GLM-5.2 is real. Z.ai (the company formerly known as Zhipu AI) announced it on June 13, 2026, with open weights landing on Hugging Face under an MIT license a few days later, dated June 16 in their release notes. It's the current flagship, sitting above GLM-5.1 (April 7, 2026) and GLM-5 (February 12, 2026). That's a new flagship roughly every two months, which tells you something about the pace over there.

The interesting part for a small operator isn't that it exists. It's that the weights are genuinely open, the license is permissive enough to use commercially, and the API price is low enough that you don't have to self-host unless you want to. Let me walk through the parts that matter.

What GLM-5.2 actually is

It's a Mixture-of-Experts model. The headline parameter count is messy, and I'd rather be honest than pick a clean-looking number. The Hugging Face card lists 753B total parameters. MarkTechPost reported 744B. The vLLM deployment recipe says 743B with 39B active per token. So call it roughly 750B total and about 40B active, with a documented few-billion spread depending on who you ask. The 744B figure is actually the GLM-5 base, which is part of why it keeps showing up. Don't quote 753B as if it were gospel.

The architecture, per the deployment recipes, runs 78 transformer layers with 256 routed experts (8 active per token), plus a single Multi-Token Prediction layer for EAGLE-style speculative decoding. The context window is the real selling point: 1M tokens, and Z.ai describes it as "lossless" rather than the degraded long-context you sometimes get. Max output lands somewhere around 128K to 131K tokens per response. MarkTechPost says 131,072, Z.ai's own doc page says 128K, so I'll give you the range instead of pretending I know which rounding they used.

There are two thinking-effort modes, High and Max. Max is the one Z.ai recommends for complex coding work. And there's a sparse-attention trick called IndexShare that reuses the same indexer across every four sparse-attention layers, cutting per-token FLOPs by 2.9x at 1M context. That's the thing that makes a million-token window financially survivable instead of a science project.

The license is the headline

Here's what I care about as someone who ships free tools for small businesses: the weights are MIT-licensed. The LICENSE file on Hugging Face literally reads "MIT License / Copyright (c) 2026 Zhipu AI." That's about as permissive as it gets. Commercial use, modification, redistribution, all fine.

The model ships in two precisions. There's full BF16 at zai-org/GLM-5.2, and an FP8 build at zai-org/GLM-5.2-FP8 whose card lists F32, BF16, and F8_E4M3 tensor types. FP8 is the recommended deployment for most people because it roughly halves the memory bill versus BF16.

Can you run it yourself? Depends on your hardware budget

This is where reality bites. Serving GLM-5.2 at FP8 needs roughly a terabyte of VRAM. The FP8 weights alone run somewhere around 744 to 756GB, so single-node FP8 means 8x H200 or 8x H20 (141GB each), which gets you 1,128GB total. If you want the full 1M context with KV-cache headroom, the guides point you at 8x B200. That is not a homelab. That is a small mortgage.

The vLLM recipe is straightforward once you have the metal:

vllm serve zai-org/GLM-5.2-FP8 \
  --tensor-parallel-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45

That --tensor-parallel-size 8 shards the weights across all eight GPUs. SGLang has its own cookbook entry that adds expert parallelism (--enable-moe-ep) and RadixAttention, which caches shared prefixes across agent steps. If you're running a coding agent that replays a lot of the same context every turn, RadixAttention is the feature that saves you money.

For the rest of us who don't own an 8x H200 node, the quantized route is the realistic one. Unsloth publishes Dynamic GGUF quants that shrink the thing dramatically:

Quant Size Reduction Where it runs
Full model 1.51TB n/a data-center territory
8-bit Dynamic ~810GB n/a multi-GPU server
2-bit Dynamic ~239GB -84% 256GB unified-memory Mac, or 24GB GPU + ~256GB RAM
1-bit Dynamic ~217GB -86% tight, for experiments

That 2-bit row is the one that changes the math. A 239GB quant fits directly on a 256GB unified-memory Mac Studio, or it runs on a single 24GB GPU plus about 256GB of system RAM using MoE offloading through llama.cpp. It's slow compared to a proper server, but it runs. For a one-person shop wanting to keep data local, that's a genuine option instead of a fantasy.

Supported inference engines are SGLang (v0.5.13.post1+), vLLM (v0.23.0+), Transformers (v0.5.12+), KTransformers, and Unsloth (v0.1.47-beta+). If you've deployed any recent open model, this stack will feel familiar.

The benchmarks, with the asterisks

First, an honest caveat that the marketing skips: Z.ai published no benchmark scores for GLM-5.2 at launch. The numbers came later, from Z.ai's own Hugging Face blog table and from third-party evals. So when you see a leaderboard screenshot, check the date.

Second, a correction you'll see violated everywhere. The 77.8 on SWE-bench Verified belongs to GLM-5, the older model, not GLM-5.2. GLM-5 hit 77.8 on SWE-bench Verified, close to Claude Opus 4.5 (80.9) and GPT-5.2 (80.0). Z.ai never published a clean SWE-bench Verified number for 5.2. Its coding numbers live on different benchmarks. So if someone tells you "GLM 5.2 gets 77.8 on SWE-bench Verified," they're quoting last quarter's model.

Here's what GLM-5.2 actually posted, sourced from Z.ai's official Hugging Face blog table, which lists the competitor columns too:

Benchmark GLM-5.2 Claude Opus 4.8 GPT-5.5
SWE-bench Pro 62.1 69.2 58.6
FrontierSWE 74.4 75.1 72.6
Terminal-Bench 2.1 81.0 85.0 84.0
GPQA-Diamond 91.2 n/a n/a
DeepSWE 46.2 n/a n/a
HLE (with tools) 54.7 n/a n/a

Read that table honestly. GLM-5.2 beats GPT-5.5 on SWE-bench Pro (62.1 vs 58.6) and edges it on FrontierSWE, while landing in a near-tie with Claude Opus 4.8 on FrontierSWE (74.4 vs 75.1). On the harder agentic stuff like SWE-bench Pro and Terminal-Bench, Opus 4.8 still leads. So it's "competitive with the frontier, a step behind the very top on some tasks," not "new king."

The generational jump is the more impressive story. Z.ai's doc page compares 5.2 against 5.1 directly: Terminal-Bench 2.1 went from 62.0 to 81.0, SWE-bench Pro from 58.4 to 62.1, and DeepSWE jumped from 18 to 46.2. That's a real improvement in one point release, not a rounding-error refresh. HLE without tools sits at 40.5, climbing to 54.7 once you give it tools.

What it costs, and the cost angle that drives the headlines

API pricing runs about $1.40 per million input tokens and $4.40 per million output tokens. VentureBeat's headline framing is that GLM-5.2 beats GPT-5.5 on multiple long-horizon coding benchmarks "for 1/6th the cost." That's the actual pitch: not "best model in the world" but "close enough to the frontier at a fraction of the price, with open weights you can self-host if you ever need to." For agentic coding work that burns tokens by the millions, the price gap compounds fast.

One claim I'm going to flag rather than repeat: several outlets reported GLM-5.2 was trained entirely on Huawei Ascend chips with no Nvidia hardware. That showed up in Decrypt and elsewhere, but I couldn't find it confirmed by a primary Z.ai source. Treat it as a press claim, not an established fact. If the chip-sovereignty angle matters to your decision, wait for the technical report rather than a headline.

Should you care?

If you're already paying frontier API prices for agentic coding and the bill stings, GLM-5.2 is worth a real evaluation. The open weights mean no vendor can pull the model out from under you, the MIT license means you can build on it commercially, and the price is roughly a sixth of the premium tier for comparable long-horizon coding performance. The 1M context with IndexShare is the part I'd test hardest, because long-context that stays coherent is where most models quietly fall apart.

If you need the absolute top score on every coding benchmark, Opus 4.8 still wins a few of them, and you should size that against what the extra points are worth to you. And if you were hoping for a small "Air" or "Flash" variant to run cheap, I couldn't find one. Prior generations had GLM-4.5-Air and GLM-4.7-Flash, but no lightweight 5.2 variant is confirmed as of this writing. For now it's the big model or a quant of the big model.

This is the bet behind my book The $20 Dollar Agency: you can run a genuinely capable AI stack for the price of a couple of lunches. An open-weight model that lands within a few points of the frontier at roughly a sixth of the API cost is exactly the kind of move that makes a tiny budget go further.

Fact-check notes and sources

  • GLM-5.2 is real, the current Z.ai flagship, announced June 13, 2026, weights dated June 16 in release notes (after GLM-5.1 on April 7 and GLM-5 on February 12): Z.ai release notes and VentureBeat.
  • Parameter count is genuinely disputed across sources: Hugging Face card says 753B, MarkTechPost says 744B, and the vLLM recipe says 743B with 39B active. I called it ~750B total / ~40B active on purpose. The 744B figure is the GLM-5 base. HF card: zai-org/GLM-5.2.
  • 1M-token context and High/Max thinking modes confirmed. Max output is reported as 131,072 by MarkTechPost but 128K on Z.ai's own doc page, so I used the "~128K-131K" range: MarkTechPost.
  • IndexShare reuses the indexer across every four sparse-attention layers, cutting per-token FLOPs 2.9x at 1M context: Hugging Face GLM-5.2 card.
  • MIT license, "Copyright (c) 2026 Zhipu AI": LICENSE file.
  • FP8 build and tensor types (F32, BF16, F8_E4M3): zai-org/GLM-5.2-FP8.
  • ~1TB VRAM for FP8 serving; 8x H200 or 8x H20 (141GB each) single-node, 8x B200 for full 1M context: Spheron self-host guide.
  • Unsloth Dynamic GGUF sizes (2-bit ~239GB, 1-bit ~217GB, 8-bit ~810GB, full 1.51TB) and the 256GB Mac / 24GB GPU + 256GB RAM options: Unsloth docs.
  • vLLM --tensor-parallel-size 8 recipe and SGLang RadixAttention / expert-parallelism details: vLLM recipe and SGLang cookbook.
  • Benchmark table (GLM-5.2 62.1 SWE-bench Pro, 74.4 FrontierSWE, 81.0 Terminal-Bench 2.1, 91.2 GPQA-Diamond, 46.2 DeepSWE, 54.7 HLE w/tools) and the competitor columns (Opus 4.8 = 69.2 / 75.1 / 85.0; GPT-5.5 = 58.6 / 72.6 / 84.0) come from Z.ai's own official blog table: Z.ai HF blog and HF model card. Terminal-Bench is 81.0 on the official table; I did not cite the 82.7 upper bound some secondary sources use.
  • Generational gains vs GLM-5.1 (Terminal-Bench 81.0 vs 62.0, SWE-bench Pro 62.1 vs 58.4): Z.ai GLM-5.2 doc.
  • Z.ai published no benchmarks at launch: MarkTechPost.
  • The 77.8 SWE-bench Verified score belongs to GLM-5, not GLM-5.2 (GLM-5 close to Opus 4.5 at 80.9, GPT-5.2 at 80.0): Weights & Biases report and digitalapplied analysis.
  • API pricing ~$1.40/M input, ~$4.40/M output: LushBinary pricing guide and LLM-Stats. "1/6th the cost" framing: VentureBeat.
  • The "trained entirely on Huawei Ascend, no Nvidia" claim is press-sourced (Decrypt) and not confirmed by a primary Z.ai source. I flagged it as unverified rather than stating it as fact.
  • GLM-5 technical report "GLM-5: From Vibe Coding to Agentic Engineering," code repo: github.com/zai-org/GLM-5.
  • No GLM-5.2-Air or GLM-5.2-Flash variant could be confirmed on Hugging Face or in Z.ai docs as of this writing.

This post is informational, not purchasing or investment advice. Model names, benchmarks, and companies are referenced as nominative fair use. Figures were accurate as of June 2026 and several are vendor-reported rather than independently verified; check the primary sources before relying on them. No affiliation is implied.

Related reading

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026