More

greyskull · 2026-05-29T18:37:25 1780079845

> task focused small models

This is tangential: and forgive my ignorance here, but is there an inherent reason why there aren't smaller, focused models from the frontier model providers?

I'm thinking something like a software-specific subset of Opus that is the default for use in Claude Code. Smaller, cheaper to deploy and consume, maybe faster.

pavpanchekha · 2026-05-29T18:54:44 1780080884

OpenAI used to make Codex-specific models, but they stopped. What I've gathered from interviews and similar is that training two models isn't worth the (small) lift from having a coding-specific model. You're pre-training on everything anyway, and coding RL is reasonably useful for general-purpose models too.

greyskull · 2026-05-29T19:13:15 1780081995

Interesting. I'd have guessed there would be meaningful opex benefits to serving smaller models.

mediaman · 2026-05-29T23:05:08 1780095908

What I've heard is that much of the model "intelligence" is a commingled bucket: although you can specialize specific knowledge somewhat, it's hard to specialize advanced reasoning to specific domains because so much of reasoning is a generalized capability that is not unique to, say, coding.

It turns out coding has to do with a lot of the same reasoning needed in math or in legal analysis, even if the grammatical expression is different.

This is less true of lower intelligence tasks. Classification requires a lot less reasoning capacity and so can be much smaller and more specialized.

greyskull · 2026-04-21T00:00:52 1776729652

I've been using Claude Code regularly at work for several months, and I successfully used it for a small personal project (a website) not long ago. Last weekend, I explored self-hosting for the first time.

Does anyone have a similar experience of having thoroughly used CC/Codex/whatever and also have an analogous self-hosted setup that they're somewhat happy with? I'm struggling a bit.

I have 32GB of DDR5 (seems inadequate nowadays), an AMD 7800X3D, and an RTX 4090. I'm using Windows but I have WSL enabled.

I tried a few combinations of ollama, docker desktop model runner, pi-coding-agent and opencode; and for models, I think I tried a few variants each of Gemma 4, Qwen, GLM-5.1. My "baseline" RAM usage was so high from the handful of regular applications that IIRC it wasn't enough to use the best models; e.g., I couldn't run Gemma4-31B.

Things work okay in a Windows-only setup, though the agent struggled to get file paths correct. I did have some success running pi/opencode in WSL and running ollama and the model via docker desktop.

In terms of actual performance, it was painfully slow compared to the throughput I'm used to from CC, and the tooling didn't feel as good as the CC harness. Admittedly I didn't spend long enough actually using it after fiddling with setup for so long, it was at least a fun experiment.

ihowlatthemoon · 2026-04-21T04:34:52 1776746092

I run a setup similar to yours and I've had the best results with Qwen3.5 27B. Specifically the Q4_K_M variant. https://unsloth.ai/docs/models/qwen3.5

I use llama-server that comes with llama.cpp instead of using ollama. Here are the exact settings I use.

llama-server -ngl 99 -c 192072 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --sleep-idle-seconds 300 -m Qwen3.5-27B-Q4_K_M.gguf

greyskull · 2026-04-21T05:13:43 1776748423

Thanks, I'll have to continue experimenting. I just ran this model Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL and it works, but if gemini is to be believed this is saturating too much VRAM to use for chat context.

How did you land on that model? Hard to tell if I should be a) going to 3.5, b) going to fewer parameters, c) going to a different quantization/variant.

I didn't consider those other flags either, cool.

Are you having good luck with any particular harnesses or other tooling?

ihowlatthemoon · 2026-04-21T10:36:21 1776767781

35B-A3B means it's a MoE model with 35B total parameters but with only 3B active at once. The one I use is the 27B dense model. Usually, dense models give better responses, but are slower than the MoE. With your 4090, you should be able to get about 50 tok/s with the dense model, which is more than enough for practical use.

If you want to keep using the same model, these settings worked for me.

llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --sleep-idle-seconds 300 -m Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf

For the harness, I use pi (https://pi.dev/). And sometimes, I use the Roo Code plugin for VS Code. (https://roocode.com/)

I prefer simplicity in my tooling, so I can understand them easier. But you might have better luck with other harnesses.

martinald · 2026-04-21T00:32:57 1776731577

Try using a MoE model (like Gemma 4 26b-a4b or qwen3.6 35b-a3b) and offload the inference to CPU. If you have enough system RAM (32GB is a bit tight tbh depending on other apps) then this works really well. You may be able to offload some layers to GPU as well though I've had issues with this in MoE models and llama.cpp.

You can keep the KV cache on GPU which means it's pretty damn fast and you should be able to hold a reasonable context window size (on your GPU).

I've had really impressive results locally with this.

I'd strongly recommend cloning llama.cpp locally btw (in wsl2) and asking a frontier model in eg Claude code to set it up for you and tweak it. In my experience the apps that sit on top of llama.cpp don't expose all the options and flags and one wrong flag can mean terrible performance (eg context windows not being cached). If you compile it from source with a coding agent it can look up the actual code when things go wrong.

You should be able to get at least 20-40tok/s on that machine on Gemma 4 which is very usable, probabaly faster on qwen3.6 since it's only 3b active params.

greyskull · 2026-04-21T00:48:10 1776732490

Thanks! These things you're mentioning like "You may be able to offload some layers to GPU...", "You can keep the KV cache on GPU..." configured as part of the llama.cpp? I wouldn't know what to prompt with or how to evaluate "correctness" (outside of literally feeding your comment into claude and seeing what happens).

Aside: what is your tooling setup? Which harness you're using (if any), what's running the inference and where, what runs in WSL vs Windows, etc.

I struggle to even ask the right questions about the workflow and environment.

martinald · 2026-04-22T00:34:16 1776818056

Yes fair enough, but try feeding my comment in :). It should be enough for it to go on. Then ask it to explain the concepts I mentioned and ask it to suggest follow-up questions for you to learn more about llama.cpp/local inference!

I've had best results with opencode. Running locally w/ 64GB RAM and Radeon 9070XT (16GB). NVidia should be easier (CUDA), I'm on Linux full time now but used to use WSL2 all the time and had all this working in it.

Ey7NFZ3P0nzAe · 2026-04-21T05:42:07 1776750127

In my case, I was also running an ASR model and a TTS model so it was a bit much for my RTX 3090. I opted to offset like 5 layers to the cpu while adding a GPU-only speculative decoding with their 0.8B model.

Working well so far.

madtowneast · 2026-04-21T00:13:00 1776730380

You are experiencing the fact that you might not have enough VRAM to load the entire model at a time. You might want to try https://github.com/AlexsJones/llmfit

greyskull · 2026-04-21T00:48:32 1776732512

It's certainly part of the problem. Thanks, I'll give that a shot.

daemonologist · 2026-04-21T01:27:21 1776734841

First of all nothing you can run locally, on that machine anyways, is going to compare with Opus. (Or even recent Sonnet tbh - some small models benchmark better but fall off a bit in the real world.) This will get you close to like ~Sonnet 4 though:

Grab a recent win-vulkan-x64 build of llama.cpp here: https://github.com/ggml-org/llama.cpp/releases - llama.cpp is the engine used by Ollama and common wisdom is to just use it directly. You can try CUDA as well for a speedup but in my experience Vulkan is most likely to "just work" and is not too far behind in speed.

For best quality, download the biggest version of Qwen 3.5 27B you can fit on your 4090 while still leaving room for context and overhead: https://huggingface.co/unsloth/Qwen3.5-27B-GGUF - I would try the UD-Q5_K_XL but you might have to drop down to Q5_K_S. For best speed, you could use Qwen 3.6 35B-A3B (bigger model but fewer parameters are active per token): https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF - probably the UD-Q4_K_S for this one.

Now you need to make sure the whole model is fitting in VRAM on the 4090 - if anything gets offloaded to system memory it's going to slow way down. You'll want to read the docs here: https://github.com/ggml-org/llama.cpp/tree/master/tools/serv... (and probably random github issues and posts on r/localllama as well), but to get started:

  llama-server -m /path/to/above/model/here.gguf --no-mmap --fit on --fit-ctx 20000 --parallel 1

This will spit out a whole bunch of info; for now we want to look just above the dotted line for "load_tensors: offloading n/n layers to GPU" - if fewer than 100% of the layers are on GPU, inference is going to be slower and you probably want to drop down to a smaller version of the model. The "dense" 27B will be slowed more by this than the "mixture-of-experts" 35B-A3B, which has to move fewer weights per token from memory to the GPU.

Go to the printed link (localhost:8080 by default) and check that the model seems to be working normally in the default chat interface. Then, you're going to want more context space than 20k tokens, so look at your available VRAM (I think the regular Windows task manager resource monitor will show this) and incrementally increase the fit-ctx target until it's almost full. 100k context is enough for basic coding, but more like 200k would be better. Qwen's max native context length is 262,144. If you want to push this to the limit you can use `--fit-target <amount of memory in MB>` to reduce the free VRAM target to less than the default 1024 - this may slow down the rest of your system though.

Finally, start hooking up coding harnesses (llama-server is providing an OpenAI-compatible API at localhost:8080/v1/ with no password/token). Opencode seems to work pretty reliably, although there's been some controversy about telemetry and such. Zed has a nice GUI but Qwen sometimes has trouble with its tools. Frankly I haven't found an open harness I'm really happy with.

greyskull · 2026-04-21T02:49:59 1776739799

Thank you for all this, I'll give it a shot. Out of curiosity, are there any resources that sort of spell this out already? i.e., not requiring a comment like this to navigate.

> nothing you can run locally, on that machine anyways, is going to compare with Opus

Definitely not expecting that. Just wanted to find a setup that individuals were content with using a coding harness and a model that is usable locally.

What does your setup look like? Model, harness, etc.

daemonologist · 2026-04-21T22:27:59 1776810479

Not that I'm aware of. It's kind of like building a PC or a bicycle - you're putting mostly-standardized parts together rather than starting from first principles, but there are so many permutations that you can either use a single known-good configuration or immerse yourself in forums and tinker until you can fit things together yourself. Plus both the inference engines and models are of course moving really fast.

I use Opus 4.7 in Claude Code lol, plus Zed (as a text editor, not a harness). Open-weights models that I can run are for me not useful for multi-turn ("agentic") tasks. I do use Qwen 3.6 for one-off tasks like "write a function to pretty-print this weird data structure" or "explain this config file," and Gemma 4 26B for non-coding tasks like "create a timestamped table of contents from this podcast transcript."

edg5000 · 2026-04-21T13:44:54 1776779094

I asked Opus through claude code to set up the best local model fitting my hardware and that worked well for me. I could run Qwen 74B or something at .7 tok/s on my 64GB DDR5 on CPU. Pretty cool. Useful for overnight stuff. (this actually worked, it's actually usable for asking questions).

unethical_ban · 2026-04-21T05:05:35 1776747935

This is exactly what I have been looking for: Something straight to the point. Thanks a lot!

greyskull · 2026-02-17T00:12:20 1771287140

Missing "OpenAI sidesteps" from the beginning of the title article title

sorenbs · 2026-02-17T00:43:24 1771289004

Yeah. Completely changes the meaning of the article. I thought Nvidia was now competing with Cerebras. That's not the case...

jeron · 2026-02-17T00:43:58 1771289038

very excited for cerebras, hopefully nvidia/amd will have less AI sales and bring back more consumer options when they realize they have abandoned/neglected the market that made them who they are

krackers · 2026-02-17T00:51:28 1771289488

Nvidia bought groq, so they might be working on their own answer to low-latency serving. (I found this good explanation of groq compared to TPU [1])

[1] https://reddit.com/r/LocalLLaMA/comments/1pw8nfk/nvidia_acqu...

DerekL · 2026-02-18T07:22:17 1771399337

Also, the original title is only 77 characters long, so there was no reason to change it. And the title should never be changed to something that misrepresents the article.

greyskull · on Nov 21, 2024

It offers packaging for deploying to a serverless environment (e.g. Lambda) analogous to how Vercel does it.

The last question is salient, and it's possible for OpenNext to break and have to catch up to changes in Next.js, though I believe there's some more direct collaboration. I'd say that's the biggest downside - it's not guaranteed compatibility.

I did a migration recently (comments elsewhere in this post), and I don't recall the specific issue, but I _do_ recall running into at least one scenario where OpenNext had made a decision that impacted - in a way that was visible to me and undesirable - how Next.js functioned. That's not a criticism, there's tradeoffs.

CharlieDigital · on Nov 21, 2024

How would it compare to running as serverless containers (rather than ECS) like Google Cloud Run or Azure Container Apps (true scale to 0)?

It seems like using serverless containers would meet most of the same objectives so I'm not clear where the delineation is here.

greyskull · on Nov 21, 2024

OpenNext does model [0] incremental static regeneration, but beyond that I actually don't know, or at least don't recall. OpenNext doesn't do per-route lambdas like vercel does, so it's not like you get any behavior differences there.

I _think_ you can get scale to zero on Lambda by deploying a docker container, too.

[0] https://opennext.js.org/aws/v2/advanced/architecture

greyskull · on Nov 20, 2024

OpenNext is just for packaging the Next.js build artifacts. The infrastructure is defined by projects that deploy those artifacts, examples here: https://opennext.js.org/aws/get_started

Some of them are, for example, Terraform projects that list the specific infra. I have experience with the SST deployment, whose website unfortunately doesn't do a great job of listing the infra architecture.

greyskull · on Nov 20, 2024

The biggest cost for us on Vercel (several hundred dollars a month) was Image Optimization, and that was because the app was being majorly inefficient with images, in part due to some default behavior in Next.js that we found unfriendly [0], and in part due to negligence. That being said, it wasn't "cheap" by any means outside of that, still hundreds a month for something that I would not consider a high traffic application (I wish I could remember more specific numbers).

Migrating to OpenNext using SST, I think we got the bills for compute and asset serving down to like $15/day or something (granted, we spent expensive engineer time on the migration).

[0] https://nextjs.org/docs/app/api-reference/components/image#s...

dbbk · on Nov 21, 2024

There's pretty much no reason to use Vercel's image optimization, just spend 30 minutes setting up Cloudflare Images and call it a day

greyskull · on Nov 22, 2024

I agree.

syndicatedjelly · on Nov 21, 2024

That’s insanely expensive for a low traffic web app. Why should anyone use Next.js, given a choice? Are the handful of milliseconds shaved off for the end user worth the cost?

jonplackett · on Nov 21, 2024

People (me at least) us NextJS for the developer experience. It really is quite good. If you mean why use Vercel, again - great developer experience. Just expensive.

greyskull · on Nov 21, 2024

1) I don't think it's related to Next, per se, but there may be behavior I didn't build the expertise to comment on. I also know that there were major inefficiencies in the application, so, for example, our P90 latency was (imo) terrible.

2) We'd have to define what constitutes low traffic vs any other arbitrary measure, so it's moot to discuss like this; all I said it wasn't high traffic. You could run it for cheaper, but there wasn't much expertise for self-hosting, for example.

3) For all I remember it may have been half that in daily cost. In any case, miniscule compared to engineer time. What was worse was the prior decision to use serverless aurora rds, that dwarfed everything else in AWS cost - I know this is tangentially related, just saying optimizing that a bit more was not the highest priority, we could do it for cheaper.

greyskull · on Nov 20, 2024

In the company I just left, I actually went through the process two or so months ago of migrating their Vercel deployment to AWS. I evaluated several options that are listed on the website and on GitHub, and we landed on using OpenNext via SST, it was a low-pain effort, especially given the CTO's desire to also migrate off of Next.js.

As other commenters have touched on - my understanding is the purpose of OpenNext is to package the output artifacts of a Next build in a way that can be deployed to a serverless environment, analogous to how Vercel does it. The supporting projects like SST and the other links in the repo are to take those OpenNext artifacts and deploy them to infrastructure generally in an opinionated way - additionally supporting some of the "extra" features described in the repository.

The last project I was working on was to then migrate from SST to Fargate, as a persistent process (serverful?) deployment was preferable for various reasons. In that scenario, we would just be running the built in server using the Next.js standalone deployment mode (effectively a `node index.js`). We didn't need the extra functionality covered by OpenNext.

kcrwfrd_ · on Nov 21, 2024

What’s the CTO’s motivation for migrating off of Next.js? And to what?

mdhb · on Nov 21, 2024

Next is actively a bad stack run by an incredibly shady company would be a good start

arez · on Nov 21, 2024

bad stack in what way? Why is vercel shady? I can understand that the business model is questionable to lock-in people with developing a framework that runs best on their own cloud, but shady would mean fot me, that they do something illegal

kcrwfrd_ · on Nov 21, 2024

Could you substantiate that?

greyskull · on Nov 21, 2024

Didn't get far enough along to understand the motivations and considered alternatives.

pier25 · on Nov 21, 2024

> especially given the CTO's desire to also migrate off of Next.js

To Remix?

0xblinq · on Nov 24, 2024

I think remix is a lot better. But the problem is that the team behind it (same as react router) have spent the last 10 years changing their mind every Tuesday and rewriting everything and breaking every API, it’s a cat and mouse game of chasing their latest idea all the time.

Btw, remix itself is deprecated. There you go.

What would I use? For sure I’d go with a MUCH saner and stable approach: inertia.js as the glue between React and a real full stack batteries included framework such as Adonis, Laravel or Rails.

greyskull · on Nov 21, 2024

Didn't get far enough along to understand the motivations and considered alternatives.

greyskull · on Sept 23, 2024

Congratulations!

I see that the book is incomplete. I didn't know that early access for books was a thing, very neat. It might be pertinent to note in your post that it's still being written, with an estimated release window of Spring 2025.

I'm very much a "consume it when it's ready" person, so I'll keep this on my watch list.

speerer · on Sept 23, 2024

I wonder whether it's the editing which is still in progress, or also the writing? The publication date seems very close if it's still being written.

(edit-clarity)

goostavos · on Sept 23, 2024

Writing is still in progress :)

No firm date for the final publication yet.

greyskull · on Sept 18, 2024

Might be pertinent to suffix this with (2023), though I see there are still recent replies

jkaplowitz · on Sept 18, 2024

It's a still-unresolved issue as far as I know; the linked ticket was only closed last year because Gitlab has no control over it as long as they want to continue using Cloudflare. The companies which do have control over it have not fixed it so far.

greyskull · on Jan 23, 2024

It's opening in Q2. Source: I'm one of the subjects of the article :^)

blamecanada · on Jan 23, 2024

Sorry man! I'm sure you'll land on your feet!