GGUF vs MLX on Apple Silicon

Here is the short version. MLX is much faster at prefill, which is the model reading your prompt. It is a bit slower at decode, which is the part where it writes the answer back.

Whether that trade matters comes down to the shape of a real request. And real requests are lopsided. You put a lot in and get a little out. To edit a codebase, the model reads the whole thing and hands back a few changed lines. The customer support tools I build are even more one-sided. The model reads the full ticket, pulls in the order and tracking info, then writes back maybe five sentences. The input is doing most of the work, usually around 70%, sometimes closer to 90%.

That is what the funnel above is getting at. Most of the cost is on the way in. So the engine that reads faster is going to win most of the time, even when it writes a little slower.

There is one more thing the snapshot leaves out. In real use you rarely send one request and wait for the answer. You chain them together. The model reads, decides, calls a tool, then reads again. Each step pays that reading cost over again. So a fast prefill keeps paying off. Every loop gets the benefit.

So that is the shape of it. Below is the setup, then the four tests one by one, then what I actually reach for day to day.

The setup

Same machine, same model, two runtimes. I ran everything through LM Studio so the only thing changing is the engine underneath. The model is small enough to sit fully in the M3 Ultra's memory, so there is no offloading to disk, and I loaded one format at a time. Both get a fair shot.

Same weights, just packaged two ways. That is the whole comparison: same model, swap the engine.

Four tests, each one a different way of asking "is it fast":

Speed vs prompt size. 432 / 100k / 200k token prompts, generate 512.
Growing-conversation taper. A 20-turn chat, ~1k context added per turn.
Sustained load. The same prompt 30 times back to back, hunting for throttling.
Needle in a haystack. A hidden code at 30k / 120k / 200k, exact-match retrieval.

I throw away one warm-up run per engine, since the first one compiles MLX's Metal kernels. Every number below is from timed runs after that.

Test 1: Speed vs prompt size

This is the main test, so it gets three charts: time to first token, decode speed, and the two added together. The first two point in opposite directions. The third one adds them up, which is the number you actually feel.

Test 2: The growing-conversation taper

Test 1 used fixed prompts. Real chats are not like that. The context grows every turn, and decode speed slowly drops as it does. So I ran one 20-turn conversation, adding about 1k tokens a turn, and watched what happened.

Test 3: Sustained load (thermals)

Same short prompt, 30 times back to back, fresh context each time. This one is about heat. Can the machine hold its speed when you hammer it for a while?

Short answer: the M3 Ultra basically does not throttle. MLX stays dead flat, run 1 looks like run 30. GGUF drifts down about 4%. It is a real difference, but a small one. This is the one spot where MLX is just quietly better. It holds steady.

Test 4: Needle in a haystack

Fast is useless if the long answers come out wrong. So I hid a unique code in the middle of a document at 30k, 120k, and 200k tokens, and asked each engine to find it. Graded on an exact match.

Both got all of them, which is the honest and slightly boring result. For retrieval at these lengths, the engine does not change anything. One thing I want to be upfront about: this was an easy haystack. Repetitive filler around one obvious fact. It shows neither engine falls apart at 200k. It does not show they reason well over long context. That is a harder test, with distractors and real documents, and it is its own post later. Read the green check as a good sign, not a guarantee.

One bonus from this test: it timed prefill again at these lengths and matched Test 1. The prefill lead held up across every run I threw at it.

So which one do I run?

I am running MLX. The headline that GGUF decodes faster at long context is true, but it only wins on that one number, and that is the number that changes the total the least. A real request is prefill plus decode, and once the prompt gets big, the prefill is most of the wait. MLX reads a 200k prompt about 44 seconds sooner, and only gives back around 8 of those on the slower decode. So it hands you the finished answer first at every size I tested.

There is a second reason, and it matters more to me. Most of what I build is agentic, which looks exactly like that funnel up top, run in a loop. I am feeding in a lot: whole reports, repos, long threads. A lot of the time I am fanning out to sub-agents that each chew through their own pile of context and hand back something small. It is mostly prefill, and the output is the easy part. That is where MLX is strongest. When I plug local models into tools like my Claude coding setup, MLX has felt clearly faster, and I do not fully know why yet. I need to test that properly. It might be more about my own workload than a universal result.

Next one I might do: the same MLX versus GGUF question, but inside real agentic loops. I have not worked out the test yet. The more I lean on these models, the more I keep running into it, so when I find a measurement worth sharing I will write it up. My hunch is that once you account for the funnel and the loop, the "GGUF is faster" line does not hold up against a real workload. We will see.