I ran GLM-5.2 locally on a Mac Studio

Here is the whole rig. GLM-5.2 runs entirely on the Mac Studio and is served over Tailscale to my laptop, where it drives a coding agent (Pi) inside Cursor's terminal. Nothing leaves the building, and there is no per-token bill. Getting it running was not simple. It needed a custom runtime, then a change to how much memory macOS hands the GPU, and it kept dying on a five-minute timeout. I will walk through each of those, then what the model was actually like to use.

The setup

One machine, one very large model, quantized to 4-bit. Plain version of what that means: a 4-bit quant is the same model with its weights stored at lower precision. It is not a smaller model with fewer parameters, just the same one packed down to fit on disk.

I also pulled pipenetwork/GLM-5.2-MLX-mixed-3_6bit earlier (~360 GB, experts at 3-bit, non-expert at 6-bit). Same architecture, so it ran into the same problem.

Why it won't load in LM Studio

The obvious first move is to load it in LM Studio. That does not work. LM Studio's MLX runtime (1.8.5 stable, 1.9.0 beta) cannot load this model. The problem is the architecture. The download is fine.

Same failure on both quants. This is an architecture problem, so the quant level does not change it.
llama.cpp added the arch but ignores the indexer entirely, falling back to dense attention.

The runtime that actually works

The fix is an unmerged pull request. I ran mlx-lm from PR ml-explore/mlx-lm #1410 (the glm-moe-dsa indexer-sharing branch) in a Python 3.12 venv. From there it serves cleanly, with a proxy in front to make it agent-friendly.

Concurrent streams share one read of the weights, so throughput adds up instead of queueing. That batching is what lets a multi-agent loop actually use the machine.

Performance, measured

These are measured numbers from the working runtime. Two terms first, in plain words: prefill is the model reading your prompt, decode is the model writing its answer. Decode here is slow but steady. Prefill is the part that starts to hurt once the prompt gets big.

That KV cache line is the one to watch. The KV cache is the memory the model uses to hold the conversation so far. It is not reserved up front. It grows as the session fills, and at 1M tokens it wants 88 GB you may not have free:

The crash: a context ceiling

This was the biggest thing I ran into. During a Pi coding session the context grew to about 74,500 tokens and the server fell over:

The cause is a default setting. macOS caps how much memory the GPU can use at about 75% of RAM (iogpu.wired_limit_mb = 0, which works out to ~384 GB here). The model takes 368 GB of that, leaving only about 16 GB for the GPU to work in. Prefilling 74.5K tokens needs more than 16 GB, so it runs out of memory and crashes. Here is the squeeze, drawn to scale against the full 512 GB:

Each crash also carries a tax. It wipes the KV cache, so the model has to read the whole conversation again from scratch, and re-prefilling 74.5K tokens cold takes about nine minutes at 140 tok/s. A crash loop costs more than downtime. Every restart spends nine minutes re-reading the same conversation before the model can answer, until you raise the limit.

The other failure: "terminated"

With the memory limit raised, a second error shows up. Turns die after about five minutes with error: terminated and zero output. The model is fine here. The cause is a timeout on the client side.

A 300-second HTTP timeout (Node/undici fetch defaults, headersTimeout / bodyTimeout = 300,000 ms) inherited by JS agents.
Measured: failed turns ran 300–317 sec, then died with 0 output. It fires whenever a single generation streams longer than 5 min at 12 tok/s.
OpenCode Desktop can't override it (bug #26602). Pi auto-retries, so it quietly self-heals.

Tools tested

Most of the front-ends I tried either can't load the model or can't survive the timeout. Pi in Cursor's terminal is where I landed.

Quant and model options

For the curious: the 4-bit MXFP4 run is the best-quality MLX option but the tightest on memory. There are faster, lighter paths if you'll trade quality, and a much smaller coder model that may simply be the better tool for code.

So is it any good?

I gave it the same test I gave Fable 5: build an Age-of-Empires-style game from a single prompt. Fable 5 nailed it from one sentence. GLM-5.2 local 4-bit did not, even after I re-prompted it a few times. The graphics were off and so was the gameplay.

Speed was not the dealbreaker. At ~12 tok/s it was still fast enough to get something working within about an hour. For a real-time chat agent that lag is painful; for batch and coding work it is usable. The strange part is the gap between thinking and doing.

Two honest open questions before anyone takes this as a verdict on GLM-5.2 itself:

The cloud / full-precision version may be a lot better. I have not run the same prompt against it yet. Planning to.
The harness matters. Pi may not be the right agent loop for this model; a stronger or more controlled harness might pull more out of it.
It is heavy on the machine. Loading ~368 GB really puts the Mac to work and eats most of the RAM, which is the whole reason the crash happened in the first place.