Really appreciate the depth of this paper; it's a welcome change from the usual model announcement blog posts. The Zhipu/Tsinghua team laid out not just the 'what' but the 'how,' which is where the most interesting details are for anyone trying to build with or on top of these models.
The post-training methodology (Sec 3) is what really stands out to me. The idea of creating specialized 'expert models' for reasoning, agents, and chat, and then distilling their capabilities into a final unified model is a fascinating approach. It feels like a more structured way to solve the "jack of all trades, master of none" problem that can plague generalist models. Instead of just mixing all the data, they're essentially having a generalist learn from a committee of specialists.
A couple of the findings from their RL experiments are pure gold for anyone working in this space. The counter-intuitive result that a single-stage RL process at the full 64K context length outperforms a progressive, multi-stage approach (Fig 6) is a fantastic lesson. I've seen teams assume the opposite would be true. Also, the pragmatic choice to use an XML-like template for function calls to avoid JSON escaping hell (Fig 4) may be a small but brilliant engineering decision that makes a huge difference in practice. Wrangling escaped code inside JSON turns out to be a mess.
The performance on SWE-bench is impressive, putting it in the same league as much larger or proprietary models. What I’d love to see, and maybe others here have thoughts, is whether this hybrid training recipe holds up outside ARC-style evals. For example, do the agentic improvements transfer to messier, real-world workflows where APIs are undocumented, partial failures are common, and user input is full of ambiguity?
algo_trader 5 hours ago [-]
Are all these "post/mid-training tweaks" important if you have a specific domain with abundant/verified/synthesis data and labels?
Can a small team working on ASI/domain-specific stick to scaling 2024-era best practices training stack? Or will they miss massive improvements?
throwdbaaway 2 hours ago [-]
So GLM-4.5 series omits the embedding layer and the output layer when counting both the total parameters and the active parameters:
> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.
This matches with the calculation I did for GLM-4.5 (355B A32B):
In [14]: 356732107008 - (775946240 * 2) # token_embd / output are 775946240 each. assume omitted
Out[14]: 355180214528
In [15]: 356732107008 - 339738624000 - (775946240 * 2) # parameters that are always active
Out[15]: 15441590528
In [16]: 339738624000 * 8 / 160 # parameters from activated experts
Out[16]: 16986931200.0
Meanwhile, GPT OSS series includes both the embedding layer and the output layer when counting the total parameters, but only includes the output layer when counting the active parameters:
> We refer to the models as “120b” and “20b” for simplicity, though they technically have 116.8B and 20.9B parameters, respectively. Unembedding parameters are counted towards active, but not embeddings.
And Qwen3 series includes both the embedding layer and the output layer when counting both the total parameters and the active parameters.
Why there is no standard in counting? Which approach is more accurate?
atq2119 16 minutes ago [-]
I'd say it depends. For the total parameter count, you should just count all parameters, since that's what matters for memory requirements.
For activated parameters: All unembedding parameters are used in every inference step during token generation, but only one column of the embeddings is used (if done right). So count accordingly, since that's what matters for memory bandwidth and therefore latency.
reissbaker 10 hours ago [-]
I've been playing around with GLM-4.5 as a coding model for a while now and it's really, really good. In the coding agent I've been working on, Octofriend [1], I've sometimes had it on and confused it for Claude 4. Subjectively, my experience has been:
1. Claude is somewhat better at whole-codebase tasks, where you need to reason over a bunch of context and consider system interactions.
2. GLM-4.5 is somewhat better at being "honest" — i.e. I rarely see it doing the things Claude does like making broken tests pass by changing the test instead of fixing the bug.
Both are quite good though, and GLM-4.5 has found bugs that both Claude 4 Sonnet and 4.1 Opus have failed to catch. In general I think Claude wins a little more frequently on debugging tasks than GLM-4.5, but it's close.
Compared to GPT-5, both Claude and GLM feel like they're more consistent, although GPT-5 sometimes has long brilliant runs where it nails everything with subjectively higher code quality than either of the latter. However, once GPT-5 goes off the rails, it's hard to get it back on track, so it can be a bit frustrating to work with in comparison.
I just read your comment and decided to give GLM-4.5 a try in Kilocode. I'd been using Gemini CLI all day to try to resolve a tricky bug in some compiler code (a compiler for a subset of C that generates microcode for... a weird architecture, I'll leave it at that). So GLM-4.5 zoomed in on the problem right away. A problem that's eluded Gemini CLI all day. Gemini was leading me on a wild goose chase implicating a function that turns out wasn't the problem (and trying to make all kinds of lame changes to the function saying that would fix the problem - and it never did because the problem wasn't that function).
p4coder 6 hours ago [-]
Sometimes getting a second pair of eyes to look at the problem helps and is usually not a judgement of smartness of the first pair of eyes. Seems like it also applies to coding agents.
diggan 4 hours ago [-]
Indeed, I've also found that various models are good at various tasks, but I have yet been able to categorize "Model X is good at Y-class of bugs", so I end up using N models for a first pass "Find the root-cause of this issue", then once it's found, pass it along to same N models for them to attempt to solve it.
So far, which model can find/solve what is really scattered all over the place.
irthomasthomas 1 hours ago [-]
You are experiencing the jagged skills frontier. All models have these weird skill gaps and prompt phrasing sensitivity. This is the main problem solved by an llm-consortium. It's expensive running multiple models in parallel for the same prompt, but the time saved is worth it for gnarly problems. It fills in the gaps between models to tame the jagged frontier.
My very first use of the llm-consortium saw me feeding in it's own source code to look for bugs. It surfaced a serious bug which only one out of the three models had spotted. Lots of problems are NP-ish so parallel sampling works really well. Googles IMO gold and openais IOI gold both used parallel reasoning of some sort.
energy123 3 hours ago [-]
This is so true. Another thing, a model might be better at something in general, but worse if the context is too long. Looking at how GLM-4.5 is trained, on lots of short context, this may be the case for it.
GPT-5: Exceptional at abstract reasoning, planning and following the intention behind instructions. Concise and intentional. Not great at manipulating text or generating python code.
Gemini 2.5 Pro: Exceptional at manipulating text and python, not great at abstract reasoning. Verbose. Doesn't follow instructions well.
Another thing I've learned is that models work better when they work on code that they themselves generated. It's "in distribution" and more comprehensible to them.
viraptor 2 hours ago [-]
The good old regression to the mean. Testing models as the second pair of eyes only when the first fails is going to give weird results... https://www.smbc-comics.com/comic/protocol
faangguyindia 6 hours ago [-]
Gemini CLI uses whole file edit format and goes through the token very fast. I use aider for this reason with diff fenced, it burns very less tokens.
3abiton 4 hours ago [-]
I am curious about your setup? Is it just gemini cli? Or are you combining it with other frameworks?
nmfisher 4 hours ago [-]
I've had similarly good experiences with GLM-4.5 for smaller projects/requests. Unfortunately that did degrade with larger contexts, so I'm still treating it as a good fallback for Sonnet 4, rather than a full-blown replacement.
faangguyindia 6 hours ago [-]
I've been using architect mode in aider
Deepseek R1 (does high level planning) combined with Qwen3 480B (does low level coding) or whatever is available from qwen code apis.
It's working great.
It solves 99.99% problem on tis own.
The seperation isn't very good in aider so i later plan to make my own tool to achieve better workflow.
mrklol 4 hours ago [-]
About your first point, I also feel like Claude is better if there’s more in the context where 4.5 is getting "worse".
faangguyindia 2 hours ago [-]
claude used to be better not anymore or atleast the difference is not that much
deepseek r1+ qwen3 is close enough along with gemini2.5 pro
so i don't see any point of claude anymore
3 hours ago [-]
nico 9 hours ago [-]
How are you using glm-4.5? Are you consuming the api or running something like glm-4.5 air locally?
reissbaker 9 hours ago [-]
I run a privacy-focused inference company, Synthetic [1], and I use our API of course :P I actually like GLM-4.5 enough that it's currently our default recommended model for new users. But yes, otherwise I'd use the official zai API most likely, or Fireworks. GLM-4.5-Air is quite good for a local model but GLM-4.5 is better; up to you if the tradeoff is worth it — there's definitely value in the data not ever leaving your machine, but it's not going to be as strong of a model.
What makes your service especially privacy friendly?
I think if you are striving for full privacy, you should implement the secure enclave idea presented by ollama, it makes the entire pipeline fully encrypted, I'm waiting for an actual provider to finally implement this.
We don't store prompts or completions for the API (our privacy policy says "for longer than 14 days," as mentioned elsewhere in this thread — we don't actually store them at all, but the 14 day legal guarantee is to make sure that if someone accidentally commits a log statement, we have a little bit of time to catch it and revert without being in breach of policy). And, we don't train on your data, even for messages in the UI: we only store UI messages in order to let you view your message history, not for training.
Compared to using — for example — DeepSeek from deepseek.com, I think we're much more private. Even compared to using OpenAI and opting-out of your data being used for training, we're still more private, since OpenAI makes no guarantees for individuals that they don't store the data — notably, any data ever sent to them is apparently now being shared with New York courts (and the New York Times!) due to their ongoing legal battle with the New York Times [1]. And compared to using OpenRouter with "data_collection: deny", we uh, actually work :P Surprisingly sad how many broken model implementations there are if you're just round-robin-ing between inference companies... Especially reasoning models, and especially with tool-calling.
(And if something's broken, you can email us and we'll generally fix it; OpenRouter doesn't actually host any models themselves, so there's not much they can do if one isn't working well other than just de-list.)
> we will not sell [personal information], except as follows:
> - We work with business partners who support us.
Uhhm, that doesn't inspire a lot of confidence TBH!
traceroute66 1 hours ago [-]
Do you let people explicitly choose EU servers ?
Also, your Privacy Policy is not currently EU GDPR compliant. ;-)
throwdbaaway 3 hours ago [-]
You support logprobs, that's wonderful! Fireworks, Synthetic, (ik_)llama.cpp, now I have a quorum.
azinman2 7 hours ago [-]
I’m curious for your service, if it’s centered around privacy, why is the data stored for 14 days at all? My understanding with fireworks is that it’s 0 logging — nothing to store. To me that’s private.
reissbaker 6 hours ago [-]
Great question! We actually don't store prompts or completions for the API at all — but legally we say 14 days so that if someone accidentally commits a log statement, we're not in breach as long as we catch it quickly and revert.
mrtesthah 5 hours ago [-]
Amazing! So I’m assuming that because it’s privacy focused, you accept payment in cryptocurrencies like Monero and Zcash?
reissbaker 5 hours ago [-]
We accept USDC and USDP crypto payments via Stripe. We don't currently support Monero or Zcash — right now all our payments are via Stripe since it simplifies security + compliance for us. It would be a pretty neat feature to build though.
sagarpatil 8 hours ago [-]
Not OP. Chutes.ai charges $0.20 per 1M tokens. I don’t think it uses caching though because I ended up burning $30 in an hour or two. I had to move back to Claude Code.
esafak 8 hours ago [-]
Caching makes price comparisons hard. Does anyone have tips?
jononor 3 hours ago [-]
Seems like we may get local, open, workstation-grade models that are useful for coding in a few years. By workstation-grade I mean a computer around 2000 USD, and by useful for coding I mean around Sonnet 4 level.
Current cloud based models are fun and useful, but a tool that is / will be so core to the developer experience, I want to be able to run locally.
orangebread 27 minutes ago [-]
I would say by end of year, not 2 years.
cft 3 hours ago [-]
This will be essential for the open source. Otherwise open source development will become unsustainable. I'm actually a little bit more optimistic. I think we will get something more than Sonnet 4 level in two years, that can be run on a $2,000 machine.
rsanek 47 minutes ago [-]
What’s up with the charts? In their first figure they show sonnet 4 as having a swebench score of 53; in the next figure it’s close to 70. In reality it’s closer to the latter. [0]
This feels like the first open model that doesn’t require significant caveats when comparing to frontier proprietary models. The parameter efficiency alone suggests some genuine innovations in training methodology. I am keen to see some independent verification of the results and to see how if does on Aider’s LLM Leaderboard.
The sheer number of things "they observed" in this paper that could be whole papers in themselves is astounding! Lots of great stuff in here around training processes and data collection+synthesis.
Does anyone have any background information on the authors? Have they published similarly impressive works in the past?
Reubend 10 hours ago [-]
Fantastic release, and it's under the Apache license too. I'm so happy that we've got open source models pushing the envelope.
sagarpatil 8 hours ago [-]
I’ve been using it and I think it’s on par with sonnet.
lumost 11 hours ago [-]
Why was qwen3 omitted from the coding benchmark but not other benchmarks?
coder543 10 hours ago [-]
Section 4.3.2 includes Qwen3-Coder.
revskill 3 hours ago [-]
Qwen is still naive for large codebase understanding.
chvid 7 hours ago [-]
This is a great model for software development - probably the best of the freely available ones.
natrys 4 hours ago [-]
Yep I think it's the best, period. Qwen3-coder perhaps took the limelight but the GLM models perform and behave better in agentic loops. I cannot believe they had gone from a 32B frontend focused GLM-4 to these beasts that can challenge Claude, in a matter of months.
darknoon 10 hours ago [-]
It's ok, somewhere between a qwen 2.5 VL and the frontier models (o3 / opus 4) on visual reasoning
khalic 3 hours ago [-]
Impressive work! Can’t wait to spin it up on HF
acqbu 2 hours ago [-]
Kimi K2 has been the best so far in my experience
buyucu 7 hours ago [-]
Huge respect to the open-source culture in China. The Chinese are really leading the world in democratizing AI.
quantumHazer 2 hours ago [-]
It's open-weight, not open source though
buyucu 2 minutes ago [-]
Close enough for now. I would love if these companies also published their datasets, but publishing datasets is just asking for legal trouble these days.
OpenAI, Anthropic and others are not even publishing their weights.
sriku 2 hours ago [-]
Ok so I asked "how many bs in bluberry" and it "thought" the following -
-----
First, the question is: "How many bs in bluberry?" That seems like a typo. I think it's meant to be "blueberry," but it's spelled "bluberry" with one 'e' missing. Also, "bs" probably refers to the letter 'b'.
So, rephrasing the question: How many times does the letter 'b' appear in the word "blueberry"?
But the word given is "bluberry," which might be a misspelling of "blueberry." Let me confirm the correct spelling. The standard spelling is "blueberry," which has two 'e's: b-l-u-e-b-e-r-r-y.
The question says "bluberry," which is missing the second 'e', so it's spelled b-l-u-b-e-r-r-y. That might be intentional or a mistake. I should answer based on the word as given.
The question is: "How many bs in bluberry?" So, I need to count the 'b's in "bluberry."
Let me write it out: B-L-U-B-E-R-R-Y.
Now, I'll go through each letter:
Position 1: B (that's one b)
Position 2: L (not b)
Position 3: U (not b)
Position 4: B (that's the second b)
Position 5: E (not b)
Position 6: R (not b)
Position 7: R (not b)
Position 8: Y (not b)
So, there are two 'b's in "bluberry."
But is "bluberry" the correct word? It's probably a misspelling of "blueberry." In "blueberry," it's B-L-U-E-B-E-R-R-Y.
Position 1: B (one)
Position 2: L
Position 3: U
Position 4: E
Position 5: B (two)
Position 6: E
Position 7: R
Position 8: R
Position 9: Y
"Blueberry" has nine letters, and there are two 'b's: one at the start and one after the 'e'.
In "bluberry," as given, it's eight letters: B-L-U-B-E-R-R-Y, with two 'b's.
Perhaps the question is tricking me with the misspelling. But I should answer based on what's written.
Another thought: "bs" might be plural for 'b', meaning how many 'b's, or it could be something else. But in context, it's likely about the letter.
Maybe it's a riddle. "Bluberry" sounds like "blueberry," but spelled differently.
Perhaps "bs" refers to something else, like the fruit or something, but that seems unlikely. The question says "bs," which is probably "b's" meaning the letter b.
Let me read the question carefully: "How many bs in bluberry?"
It might be a play on words. "Bluberry" could be intended as "blueberry," and "bs" might be misinterpreted.
Another idea: in some contexts, "b" could stand for something, but here it's probably straightforward.
Perhaps it's about the sound. But the question is written, so it's about spelling.
I think I should just count the 'b's in the given word.
So, in "bluberry": letters are B, L, U, B, E, R, R, Y. So, B appears twice.
Therefore, there are two 'b's.
But to be thorough, let's consider if "bluberry" is a word. It's not standard; it's a misspelling. However, for the purpose of this question, I'll use it as is.
Perhaps the question has a typo, and it's meant to be "blueberry." But since it's written as "bluberry," I'll go with that.
Final answer should be 2.
The question says "bs," which might be lowercase, but in counting, case usually doesn't matter, and here it's all lowercase or mixed? The word is "bluberry," which is lowercase, and "bs" is lowercase, so it's fine.
In English, when counting letters, we don't distinguish case unless specified.
So, I think two is correct.
But let me double-check the spelling. The user wrote "bluberry," which has one 'u' and no 'e' after 'u', but in "blueberry," it's "blue" with 'e'.
In "bluberry," it's "blu" without 'e', then "berry."
But in terms of letter count, it's clear.
Perhaps it's "blueberry" misspelled, and I should correct it, but I think for accuracy, I'll use the given spelling.
Another thought: the question might be "how many b's in blueberry?" but written with a typo.
But since it's "bluberry," I'll answer based on that.
So, response should be: There are two 'b's in "bluberry."
To make it clear, I can say: The word "bluberry" contains the letter 'b' twice.
But the question is "how many bs," so I can just say 2.
Perhaps it's a trick with the word "berry" having one 'b', but "blu" has one, so total two.
"Blu" has one 'b', "berry" has one 'b', but in "bluberry," it's combined, so still two 'b's.
Yes.
I think I'm overthinking it.
Final answer: 2
-----
almaight 4 hours ago [-]
so sarcastic when it first came out, but now you are so polite.
The post-training methodology (Sec 3) is what really stands out to me. The idea of creating specialized 'expert models' for reasoning, agents, and chat, and then distilling their capabilities into a final unified model is a fascinating approach. It feels like a more structured way to solve the "jack of all trades, master of none" problem that can plague generalist models. Instead of just mixing all the data, they're essentially having a generalist learn from a committee of specialists.
A couple of the findings from their RL experiments are pure gold for anyone working in this space. The counter-intuitive result that a single-stage RL process at the full 64K context length outperforms a progressive, multi-stage approach (Fig 6) is a fantastic lesson. I've seen teams assume the opposite would be true. Also, the pragmatic choice to use an XML-like template for function calls to avoid JSON escaping hell (Fig 4) may be a small but brilliant engineering decision that makes a huge difference in practice. Wrangling escaped code inside JSON turns out to be a mess.
The performance on SWE-bench is impressive, putting it in the same league as much larger or proprietary models. What I’d love to see, and maybe others here have thoughts, is whether this hybrid training recipe holds up outside ARC-style evals. For example, do the agentic improvements transfer to messier, real-world workflows where APIs are undocumented, partial failures are common, and user input is full of ambiguity?
Can a small team working on ASI/domain-specific stick to scaling 2024-era best practices training stack? Or will they miss massive improvements?
> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.
This matches with the calculation I did for GLM-4.5 (355B A32B):
Meanwhile, GPT OSS series includes both the embedding layer and the output layer when counting the total parameters, but only includes the output layer when counting the active parameters:> We refer to the models as “120b” and “20b” for simplicity, though they technically have 116.8B and 20.9B parameters, respectively. Unembedding parameters are counted towards active, but not embeddings.
And Qwen3 series includes both the embedding layer and the output layer when counting both the total parameters and the active parameters.
Why there is no standard in counting? Which approach is more accurate?
For activated parameters: All unembedding parameters are used in every inference step during token generation, but only one column of the embeddings is used (if done right). So count accordingly, since that's what matters for memory bandwidth and therefore latency.
1. Claude is somewhat better at whole-codebase tasks, where you need to reason over a bunch of context and consider system interactions.
2. GLM-4.5 is somewhat better at being "honest" — i.e. I rarely see it doing the things Claude does like making broken tests pass by changing the test instead of fixing the bug.
Both are quite good though, and GLM-4.5 has found bugs that both Claude 4 Sonnet and 4.1 Opus have failed to catch. In general I think Claude wins a little more frequently on debugging tasks than GLM-4.5, but it's close.
Compared to GPT-5, both Claude and GLM feel like they're more consistent, although GPT-5 sometimes has long brilliant runs where it nails everything with subjectively higher code quality than either of the latter. However, once GPT-5 goes off the rails, it's hard to get it back on track, so it can be a bit frustrating to work with in comparison.
1: https://github.com/synthetic-lab/octofriend
So far, which model can find/solve what is really scattered all over the place.
My very first use of the llm-consortium saw me feeding in it's own source code to look for bugs. It surfaced a serious bug which only one out of the three models had spotted. Lots of problems are NP-ish so parallel sampling works really well. Googles IMO gold and openais IOI gold both used parallel reasoning of some sort.
GPT-5: Exceptional at abstract reasoning, planning and following the intention behind instructions. Concise and intentional. Not great at manipulating text or generating python code.
Gemini 2.5 Pro: Exceptional at manipulating text and python, not great at abstract reasoning. Verbose. Doesn't follow instructions well.
Another thing I've learned is that models work better when they work on code that they themselves generated. It's "in distribution" and more comprehensible to them.
Deepseek R1 (does high level planning) combined with Qwen3 480B (does low level coding) or whatever is available from qwen code apis.
It's working great.
It solves 99.99% problem on tis own.
The seperation isn't very good in aider so i later plan to make my own tool to achieve better workflow.
deepseek r1+ qwen3 is close enough along with gemini2.5 pro
so i don't see any point of claude anymore
1: https://synthetic.new
I think if you are striving for full privacy, you should implement the secure enclave idea presented by ollama, it makes the entire pipeline fully encrypted, I'm waiting for an actual provider to finally implement this.
https://ollama.com/blog/secureminions
Compared to using — for example — DeepSeek from deepseek.com, I think we're much more private. Even compared to using OpenAI and opting-out of your data being used for training, we're still more private, since OpenAI makes no guarantees for individuals that they don't store the data — notably, any data ever sent to them is apparently now being shared with New York courts (and the New York Times!) due to their ongoing legal battle with the New York Times [1]. And compared to using OpenRouter with "data_collection: deny", we uh, actually work :P Surprisingly sad how many broken model implementations there are if you're just round-robin-ing between inference companies... Especially reasoning models, and especially with tool-calling.
(And if something's broken, you can email us and we'll generally fix it; OpenRouter doesn't actually host any models themselves, so there's not much they can do if one isn't working well other than just de-list.)
1: https://arstechnica.com/tech-policy/2025/07/nyt-to-start-sea...
> we will not sell [personal information], except as follows: > - We work with business partners who support us.
Uhhm, that doesn't inspire a lot of confidence TBH!
Also, your Privacy Policy is not currently EU GDPR compliant. ;-)
[0] https://www.swebench.com/
Does anyone have any background information on the authors? Have they published similarly impressive works in the past?
OpenAI, Anthropic and others are not even publishing their weights.
-----
-----