z.ai's GLM-5 Claims Record-Low Hallucination Rate. It's More Complicated Than That.

A 744B open-source model beat the hallucination benchmark by 35 points and costs 6x less than Claude Opus. But one safety researcher is already worried.

Chinese AI startup z.ai (Zhipu) released GLM-5, a 744-billion-parameter open-source model under the MIT License. The headline claim: the lowest hallucination rate of any model tested on the Artificial Analysis AA-Omniscience Index. That sounds big. What does it actually mean?

The model uses a Mixture-of-Experts (MoE) architecture: 744B total parameters, but only 40B active per token, about 5% of the model. You get the knowledge of a 744B model at roughly the compute cost of a 40B one. That's why they can charge $0.80 per million input tokens when Claude Opus costs $5.

What "record-low hallucination" actually means

The score comes from Artificial Analysis, an independent benchmarking platform. Their AA-Omniscience Index measures two things: how often a model abstains from answering when it doesn't know, and how accurate it is on factual QA when it does answer. GLM-5 scored -1 on this index, a 35-point improvement over GLM-4.5.

That is not the same as "this model won't hallucinate in your production app." The benchmark asks 6,000 factual questions and checks two things: does the model answer correctly, and does it say "I don't know" instead of making something up? It doesn't test long-form generation or multi-step reasoning, which is where hallucinations actually compound in practice.

There's a subtlety here. GLM-5's overall Omniscience Index score (-1) is lower than models like Gemini 3 Pro Preview (13), Claude Opus 4.6 (11), or Claude Opus 4.5 (10). Those models score higher overall because they answer more questions correctly. But the "record-low hallucination" claim is about the hallucination rate specifically. When GLM-5 does answer, it makes things up less often than any other model tested. It achieves this by abstaining more aggressively: saying "I don't know" instead of guessing. That's a real tradeoff. You get fewer wrong answers, but also fewer answers.

Where does GLM-5 actually sit?

On coding, GLM-5 scored 77.8 on SWE-bench Verified. That's close to but behind Claude Opus 4.5 (80.9). On Vending Bench 2, it came in #1 among open-source models. Vending Bench 2 simulates running a business for a year and scores by final bank balance; GLM-5 finished with $4,432.

GLM-5 beats every other open-weight model on most benchmarks. Against GPT-5 and Claude Opus, it wins some and loses some. Best open-weight model right now? Probably. Same tier as the best closed models? Not yet. Closing the gap? Maybe.

One thing we don't have clarity on: language-specific performance. z.ai is a Chinese lab, and MoE models can have uneven performance across languages depending on training data mix. z.ai's blog doesn't break down results by language.

What is Slime?

z.ai built a custom training infrastructure called "Slime" to solve a specific problem: conventional reinforcement learning (RL) pipelines waste most of their time on generation bottlenecks. According to z.ai, these bottlenecks consume over 90% of RL training time. Slime has been z.ai's RL backbone since GLM-4.5 in mid-2025.

Slime is an asynchronous RL framework that breaks the lockstep between trajectory generation and training. Instead of waiting for each batch to finish before starting the next, trajectories run independently. So z.ai can iterate on agent behavior much faster than with a standard RL setup. The code is open-source on GitHub.

What's verified vs. what's vendor-reported

The "independently verified" claims below come from Artificial Analysis, a benchmarking platform (not a peer-reviewed journal) that tested the hallucination claim and confirmed pricing. Everything else is z.ai's own reporting. The model is open, so anyone can reproduce the benchmarks, but nobody has yet.

Pricing and access

Through z.ai's own API, GLM-5 costs $1.00 per million input tokens and $3.20 per million output, with cached input at $0.20. That's about 5x cheaper on input and nearly 8x cheaper on output than Claude Opus 4.6 ($5/$25). MIT License, so companies can fine-tune and deploy without licensing restrictions. Weights are on HuggingFace, and there's a hosted chat interface at chat.z.ai.

It also has a native agent mode that generates .docx, .pdf, and .xlsx files directly from prompts. z.ai is targeting enterprise document automation with this. Zhipu's stock jumped 30% on the announcement, along with other Chinese AI shares.

What to watch

Independent replications. Can other labs reproduce the benchmark numbers on their own hardware?
Enterprise pilots. Which companies actually start testing GLM-5 for production use, and what do they find?
Price pressure on Western labs. Does this force OpenAI, Anthropic, or Google to cut prices or speed up releases?
The open-closed gap. GLM-5 is the closest an open-weight model has come to matching the best closed ones. One more jump like this and paying $5-$25 per million tokens for a proprietary API gets harder to justify.

GLM-5 is the strongest open-weight model out there. It doesn't match the best closed models yet, but the gap is smaller than it was six months ago, and the price is a fifth. Another jump like this and $5-per-million-token APIs start looking like a luxury.

Back to 7min.ai