I run the larger version of it on a Threadripper with 512GB RAM and a 32GB GPU for the non-expert layers and context, using llama.cpp. Performs great, however god forbid you try to get that much memory these days.
I sometimes still code with a local LLM but can't imagine doing it on a laptop. I have a server that has GPUs and runs llama.cpp behind llama-swap (letting me switch between models quickly). The best local coding setup I've been able to do so far is using Aider with gpt-oss-120b.
I guess you could get a Ryzen AI Max+ with 128GB RAM to try and do that locally but non-nVidia hardware is incredibly slow for coding usage since the prompts become very large and take exponentially longer but gpt-oss is a sparse model so maybe it won't be that bad.
Also just to point it out, if you use OpenRouter with things like Aider or roocode or whatever you can also flag your account to only use providers with a zero-data retention policy if you are truly concerned about anyone training on your source code. GPT5 and Claude are infinitely better, faster and cheaper than anything I can do locally and I have a monster setup.
gpt-oss-120b is amazing. I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc). ChatGPT finished a 50 question quiz in 6 min with a score of 46 / 50. gpt-oss-120b took over an hour but got 47 / 50. All the other local LLMs I tried were small and performed way worse, like less than 50% correct.
I ran this on an i7 with 64gb of RAM and an old nvidia card with 8g of vram.
EDIT: Forgot to say what the RAG system was doing which was answering a 50 question multiple choice test about GCP and cloud engineering.
Yup, I agree, easily best local model you can run today on local hardware, especially when reasoning_effort is set to "high", but "medium" does very well too.
I think people missed out on how great it was because a bunch of the runners botched their implementations at launch, and it wasn't until 2-3 weeks after launch that you could properly evaluate it, and once I could run the evaluations myself on my own tasks, it really became evident how much better it is.
If you haven't tried it yet, or you tried it very early after the release, do yourself a favor and try it again with updated runners.
Not parent, but frequent user of GPT-OSS, tried all different ways of running it. Choice goes something like this:
- Need batching + highest total throughoutput? vLLM, complicated to deploy and install though, need special versions for top performance with GPT-OSS
- Easiest to manage + fast enough: llama.cpp, easier to deploy as well (just a binary) and super fast, getting ~260 tok/s on a RTX Pro 6000 for the 20B version
- Easiest for people not used to running shell commands or need a GUI and don't care much for performance: Ollama
Then if you really wanna go fast, try to get TensorRT running on your setup, and I think that's pretty much the fastest GPT-OSS can go currently.
> I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc)
If you share the scripts to gather the GCP documentation this, that'd be great. Because I have had an idea to do something like this, and the part I don't want to deal with is getting the data
Mentions 120b is runnable on 8GB VRAM too: "Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too"
I have in many cases had better results with the 20b model, over the 120b model.
Mostly because it is faster and I can iterate prompts quicker to choerce it to follow instructions.
> had better results with the 20b model, over the 120b model
The difference of quality and accuracy of the responses between the two is vastly different though, if tok/s isn't your biggest priority, especially when using reasoning_effort "high". 20B works great for small-ish text summarization and title generation, but for even moderately difficult programming tasks, 20B fails repeatedly while 120B gets it right on the first try.
But the 120b model has just as bad if not worse formatting issues, compared to the 20b one. For simple refactorings, or chatting about possible solutions i actually feel teh 20b halucinates less than the 120b, even if it is less competent. Migth also be because of 120b not liking being in q8, or not being properly deployed.
> But the 120b model has just as bad if not worse formatting issues, compared to the 20b one
What runtime/tools are you using? Haven't been my experience at all, but I've also mostly used it via llama.cpp and my own "coding agent". It was slightly tricky to get the Harmony parsing in place and working correct, but once that's in place, I haven't seen any formatting issues at all?
The 20B is definitely worse than 120B for me in every case and scenario, but it is a lot faster. Are you running the "native" MXFP4 weights or something else? That would have a drastic impact on the quality of responses you get.
Edit:
> Migth also be because of 120b not liking being in q8
Yeah, that's definitely the issue, I wouldn't use either without letting them be MXFP4.
Hmmm...now that you say that, it might have been the 20b model.
And like a dumbass I accidentally deleted the directory and didn't have a back up or under version control.
Either way, I do know for a fact that the gpt-oss-XXb model beat chatgpt by 1 answer and it was 46/50 at 6 minutes and 47/50 at 1+ hour. I remember because I was blown away that I could get that type of result running locally and I had texted a friend about it.
I was really impressed but disappointed at the huge disparity between time the two.
I used pg vector chunking on paragraphs. For the answers I saved in a flat text file and then parsed to what I needed.
For parsing and vectorizing of the GCP docs I used a Python script. For reading each quiz question, getting a text embedding and submitting to an LLM, I used Spring AI.
It was all roll your own.
But like I stated in my original post I deleted it without backup or vcs. It was the wrong directory that I deleted. Rookie mistake for which I know better.
In a sense you can think of it that way, as a Canadian we counter-tariff the US and that can be considered punishing us; however the US is only one country and it encouraged more free trade with every other one of our trading partners so in a game theory sense it's affecting Canadian trade negatively with one country and affecting US trade negatively with you know.. every country.
Exactly right. There are trade deals forming between countries that in unprecedented ways to avoid dealing with the constantly changing tariffs while one country says they'll take their ball and play alone.
But the US is the bigger country just next to it, also the most practical to trade with. Trading with country further appart means less efficient in transport. Is it not still self inflicted harm?
The Econ 101 view would say yes, note most countries haven't imposed 1:1 retaliatory tariffs.
But economic considerations are not the only ones. Opposition to the American Revolution is a fundamental theme in Canadian history. People shouldn't be surprised when Canada acts accordingly.
What options do Canadians have? Deal with the wildly capricious economic policies of the US president, or go seeking other, more stable opportunities elsewhere? Almost all countertariffs we have in place are targeted as opposed to the sweeping tariffs Trump is implementing.
They could seek other opportunities elsewhere without adding tariff themselves: continue to import from the US and other countries like before. They may indeed export less to the US due to reduced demand from the US, but reciprocating the tariff won't help with that.
It's not practical when Trump sees a TV ad that enrages him and then cancels all negotiations, how are Canadian leaders supposed to proceed? There's no good faith whatsoever from him.
On iOS you can deny an app cellular data access which accomplishes this, as long as you don't launch it on Wifi. But yes I too wish I could deny apps internet access completely.
The commands aren't the special sauce, it's the analytical capabilities of the LLM to view the outputs of all those commands and correlate data or whatever. You could accomplish the same by prefilling a gigantic context window with all the logs but when the commands are presented ahead of time the LLM can "decide" which one to run based on what it needs to do.
Electronic warfare is pretty effective against drones that are using radio waves for their communication. Earlier in the war you could see a lot of drone footage that would become washed out with static as they got closer to tanks so it's much more reliable to use spools of fiber.
Using Aider with o3 in architect mode, with Gemini or with Sonnet (in that order) is light years ahead of any of the IDE AI integrations. I highly recommend anyone who's interested in AI coding to use Aider with paid models. It is a night and day difference.
With aider and Gemini Pro 2.5 at least I constantly have to fight against it to keep it focused on a small task. It keeps editing other parts of the file, doing small "improvements" and "optimizations" and commenting here and there. To the point where I'm considering switching to a graphical IDE where the interface would make it easier to accept or dismiss parts of changes (per lines/blocks, as opposed to a per file and per commit approach with aider).
Would you mind sharing more about your workflow with aider? Have you tried the `--watch-files` option? [0] What makes the architect mode [1] way better in your experience?
I use o3 with architect mode for larger changes and refactors in a project. It seems very suited to the two-pass system where the (more expensive) "reasoning" LLM tells the secondary LLM all the changes.
For most of the day I use Gemini Pro 2.5 in non-architect mode (or Sonnet when Gemini is too slow) and never really run into the issue of it making the wrong changes.
I suspect the biggest trick I know is being completely on top of the context for the LLM. I am frequently using /reset after a change and re-adding only relevant files, or allowing it to suggest relevant files using the repo-map. After each successful change if I'm working on a different area of the app I then /reset. This also purges the current chat history so the LLM doesn't have all kinds of unrelated context.
I use Gemini in VScode via Cline and also in Zed. I like Aider, but I'm not sure how it's "light-years ahead IDE AI integrations" ubless you only mean stuff like Cursor or Windsurf.
Aider has a configuration for each supported LLM to define the best diff format for each; so for certain ones they're best at diff format, Gemini is best at a fenced-diff format, Qwen3 is best at whole file editing, etc. Aider itself examines the diff and re-runs the request when the request when the response doesn't adhere to the corresponding diff format.
Edit: Also the Aider leaderboards show the success rate for diff adherence separately, it's quite useful [1]
The new AMD chips in the Framework laptops would be a good candidate and I think you can get 96GB RAM in them. Also if the LLM software is idle (like llama.cpp or ollama) there is negligible extra power consumption.
This seems super amateur and your privacy policy sucks compared to OpenRouter. Also it's weird you have time to respond to trivial questions but not questions like "why would I use this over the entrenched leader in this space".
No docs or info without signing up, confusing Grok and Groq, and claiming you have access to o4 models which haven't been released makes this look like an incredibly unserious offering.
You are correct, of course, but do you think your comment would be better received if you presented it as constructive feedback? e.g.
Consider enhancing your privacy policy to match industry standards, similar to OpenRouter. Focus on addressing significant questions, like how your product stands out from established competitors. Ensure there’s no confusion between Grok and Groq. Also, verify the availability of features like access to o4 models to avoid any misunderstandings.
Who cares if it’s better received? They are lying about model access and incorrectly identify the producers of AI models. This business seem to be made by unserious people and sounding the alarm is the correct thing to do.
I considered being polite but the fact OP had already chosen to not answer these actual questions makes me question why anyone would trust this site with their payment data or even personal data.