I've come accept that producing code I'm truly proud of is now my hobby, not my career. The time it takes to write Good Code is unjustifiable in a business context and I can't make the case for it outside of personal projects.
Yeah I don't understand why everyone seems to have forgotten about the Gemini options. Antigravity, Jules, and Gemini CLI are as good as the alternatives but are way more cost effective. I want for nothing with my $20/mo Google AI plan.
Yeah I'm on the $20/mo Google plan and have been rate limited maybe twice in 2 months. Tried the equivalent Claude plan for a similar workload and lasted maybe 40 minutes before it asked me to upgrade to Max to continue.
> Yeah I'm on the $20/mo Google plan and have been rate limited maybe twice in 2 months. Tried the equivalent Claude plan for a similar workload and lasted maybe 40 minutes before it asked me to upgrade to Max to continue.
The TLDR: The $20/40m cost is more reflective of what inference actually costs, including the amortised cost of the Capex, together with the Opex.
The Long Read:
I think the reason is because Anthropic is attempting to run inference at a profit and Google isn't.
Another reason could be that they don't own their cost centers (GPUs are from Nvidia, Cloud instances are from AWS, data centers from AWS, etc); they own only the model but rent everything else needed for inference so pay a margin for all those rented cost centers.
Google owns their entire vertical (GPUs are google-made, Cloud instances and datacenters are Google-owned, etc) and can apply vertical cost optimisations, so their final cost of inference is going to be much cheaper anyway even if they were not subsidising inference with their profits from unrelated business units.
It's crazy that we're having such different experiences. I purchased the Google AI plan as an alternative to my ChatGPT (Codex) daily driver. I use Gemini a fair amount at work, so I thought it would be a good choice to use personally. I used it a few times but ran into limits the first few projects I worked on. As a result I switched to Claude and so, far, I haven't hit any limits.
It's more (exactly?) like pulling a .sh file hosted on someone else's website and running it as root, except the contents of the file are generated by a LLM, no one reads them, and the owner of the website can change them without your knowledge.
For better or worse it's simply no longer possible to operate a healthcare provider organization using paper records while maintaining compliance with federal interoperability and reporting mandates. That time has passed.
This is currently negative expected value over the lifetime of any hardware you can buy today at a reasonable price, which is basically a monster Mac - or several - until Apple folds and rises the price due to RAM shortages.
$2000 will get you 30~50 tokens/s on perfectly usable quantization levels (Q4-Q5), taken from any one among the top 5 best open weights MoE models. That's not half bad and will only get better!
If you are running lightweight models like deepseek 32B. But anything more and it’ll drop. Also, costs have risen a lot in the last month for RAM and AI adjacent hardware. It’s definitely not 2k for the rig needed for 50 tokens a second
Could you explain how? I can't seem to figure it out.
DeepSeek-V3.2-Exp has 37B active parameters, GLM-4.7 and Kimi K2 have 32B active parameters.
Lets say we are dealing with Q4_K_S quantization for roughly half the size, we still need to move 16 GB 30 times per second, which requires a memory bandwidth of 480 GB/s, or maybe half that if speculative decoding works really well.
Anything GPU-based won't work for that speed, because PCIe 5 provides only 64 GB/s and $2000 can not afford enough VRAM (~256GB) for a full model.
That leaves CPU-based systems with high memory bandwidth. DDR5 would work (somewhere around 300 GB/s with 8x 4800MHz modules), but that would cost about twice as much for just the RAM alone, disregarding the rest of the system.
Can you get enough memory bandwidth out of DDR4 somehow?
Look up AMD's Strix Halo mini-PC such as GMKtec's EVO-X2. I got the one with 128GB of unified RAM (~100GB VRAM) last year for 1900€ excl. VAT; it runs like a beast especially for SOTA/near-SOTA MoE models.
Look, it's either this or a dozen articles a day about Claude Code.
reply