Ornith 9B on a 16GB Mac mini: where a small model breaks

Ornith 1.0 just shipped. It is a family of open-source LLMs built specifically for agentic coding, the kind of work where the model calls tools, reads files, and writes them back. It comes in four sizes: 9b-dense, 31b-dense, 35b-moe, and 397b-moe. For this one I am running the smallest, the 9 billion parameter version, on a 16 GB Mac mini. A 16 GB device is one of the most accessible, affordable ways into local AI coding, and it is the machine I keep around for testing small models, because I think it is one of the best representations of what people actually have at home.

Why the 9B, and why 16 GB

The 9B is the only size that fits a 16 GB machine with real room left for context. That makes it the natural one to test here: not because it is the best Ornith, but because it is the one this hardware can actually run, and the hardware is the point.

The 9B is small on purpose. It is fast, it is light, and it runs happily on a Mac mini. I downloaded both the GGUF and the MLX builds and ran them with no trouble. The question was never whether it would load. The question was whether it could do real work once it did.

The setup

My harness is Pi Agent, running inside Cursor. Pi Agent lets you point it at any model and backend, so I wired it to LM Studio, which is where the Ornith models actually run. I loaded two sizes, the 9B on the Mac mini and the 35B on a Mac Studio, and switch between them with one click. Both are configured identically.

That last setting matters for everything below. By default Pi Agent cuts a request after five minutes. I turned that off so the agent would wait as long as the model needed. None of the failures here were the agent giving up early. The agent was patient. The model was the problem.

Here is how it actually loads in LM Studio. The weights are 5.6 GB. On a 16 GB machine I budget around 12 GB for the model plus its context, and that lands the context length at roughly 186,000 tokens. That is the base, no-tricks setup:

The model actually supports the full 262,144 tokens (256K). To reach that on a 16 GB machine you turn on flash attention and quantize the KV cache to Q8, which compresses the conversation memory the model holds:

Now you get the full window, and because the Q8 cache is more compact than the default F16, total memory actually drops to about 10.29 GB. More context, less memory:

Q8 KV cache is experimental and can upset some models, so test before you rely on it. On a small model and a tight machine, though, it is the cleanest way to go from 186K to the full 256K without buying more RAM.

The error

One prompt, nothing fancy:

The reasoning was good. It planned a genuinely feature-rich game (multiple tower types, enemy types, an economy, waves, upgrades), worked out a grid and a tower list, and even ran mkdir cleanly to create the folder. Everything up to the actual file was fine. Then it reached the one step that mattered, the write, and the very first call came back empty.

From there it never recovered. It kept calling the write tool, and the tool kept rejecting the call with the same error:

Then the model would notice the error, say something like “I keep getting validation errors, let me try again,” and call the tool once more. Same empty result. It did this 39 times in one session before I stopped it. The key line is Received arguments: {}. The model called the write tool but handed over nothing: no path, no content, an empty box.

What was actually happening

When an agent writes a file, it does not just type text. It has to package the file into a structured request. Think of it like a form with two boxes: one for the file name, one for the file content. The whole game, every line of code, has to go into that content box as one clean block of valid JSON.

The 9B kept handing over the form with both boxes empty. So I ran three tests to figure out why.

Test 1 proves the tool and the format are fine. Test 2 isolates the failure to a big, structured output. Test 3 shows the same model family handles it the moment you give it more capacity.

The budget was never the problem

Both models had the same 32K output limit. If the limit were the bottleneck, the 35B would have struggled too. It did not. It finished using about 13K of those tokens, with 19K to spare. The 9B had the exact same room and still could not do it. The gap is capability, not space.

Would a smaller output limit help?

It sounds reasonable: force a smaller limit and the model writes smaller files. It does not work that way. An output limit is a ceiling, not a goal. The model does not read the limit and decide to write something tighter. It writes the same big thing it was always going to write, and the limit just chops it off when it runs out of room. Like a word count cap on an essay, it does not make you write a leaner essay, it just cuts you off mid sentence.

I tested it. I dropped the 9B to an 8K limit and asked for the big game again. It still failed. Then I kept the 8K limit but changed the request: I asked for a tiny file instead. It wrote it instantly, clean and valid, in 80 tokens.

A note on the quant

I ran the 4-bit GGUF build. On a 16 GB machine that is the sweet spot: small enough to leave real room for context. A higher quant is more precise, and for a small model that precision is exactly what you want for clean tool calls. But precision costs memory, and on 16 GB you pay for it twice, once in the weights and again in the context you have to give up. Here is the trade-off:

My theory going in was that a higher quant might make the tool calls crisper, since precision is exactly what a small model lacks. So before writing this off, I went and tested it.

Does a higher quant save it? No.

I downloaded the same 9B as a Q8 GGUF and ran the exact same prompt. This time it did better, and that is worth saying plainly: it actually wrote the file. No empty tool calls, no loop on the write. It delivered an HTML file and told me it was ready to play.

Then I opened it. Blank screen. The HUD was there (lives, money, score) and the three tower buttons sat at the bottom, but the play area was empty. No game.

I told it the screen was empty. Instead of fixing it, it went into a different kind of loop. It poured out wall after wall of reasoning, theorizing about what might be wrong (was it a ReferenceError, the timer, a wave-spawn condition, maybe I was opening it wrong), proposing the same kinds of fixes, and never landing one. It had no way to actually run the file and see for itself, so it just kept thinking.

And this was running on the Mac Studio, faster and with no memory constraints at all. It still could not get there. So the precision theory does not really hold. Q8 moved the failure rather than fixing it: the model could now emit a complete file, but not a working one, and it could not debug its own output once it was wrong. Even at Q8, a 9B is not the model for a build from scratch.

What about writing it in chunks?

The obvious workaround is to stop asking for one big file. In a separate test I had the 9B write the game in six smaller pieces and then stitch them together with a bash script. Each piece on its own was fine. The stitch was not.

When you concatenate six separately-generated chunks, the seams are where it breaks. A comma in the wrong place, a brace that does not close, a duplicated line. The model never had the whole file in view at once, so it could not keep the structure consistent across the joins. Chunking trades one hard problem (hold it all together) for a different one (make the pieces line up), and on a small model neither comes for free.

9B vs 35B: what I take from this

Writing a big file from scratch is one of the hardest things you can ask an agent to do. The model has to plan the whole thing, then produce thousands of lines without breaking the structure once. A small model loses the thread. A bigger one holds it. That is the line between these two sizes, drawn cleanly.

And if the 9B is all your hardware will run, optimize how you ask. Tell it to work in smaller pieces, request smaller files, edit instead of generate. You are not fixing the model, you are keeping it inside the size of job it can actually finish.