I also want to add on that I really appreciate the benchmarks.
When I was working with RAG llama.cpp through RN early last year I had pretty acceptable tok/sec results up through 7-8b quantized models (on phones like the S24+ and iPhone 15pro). MLC was definitely higher tok/sec but it is really tough to beat the community support and availability in the gguf ecosystem.
I get what you mean about wanting a visual app to experience yourself and be able to point others too. I recently followed this MLX tutorial for making a small model act well for home speaker automation/tool-use that I think could potentially be used to make a good all-in-one demo: https://www.strathweb.com/2025/01/fine-tuning-phi-models-wit... (it was fast and easy to do on a MacBook pro)
Nice to see a clear example of doing this entirely locally on a MBP. It ran >2x faster on my M2 MBP compared to the numbers they showed for an M1. Only 23/25 of the test cases passed for me on the fine-tuned model following the README 1:1, but the speedup from fine-tuned versus off-the shelf was clear. Thanks for sharing.
Depends on what benchmarks/reports you trust I guess (and how much hardware you have for local models either in-person or in-cloud). https://aider.chat/docs/leaderboards/ has Deepseek v3 scoring higher than most closed LLMs on coding (but it is a huge local model). And https://livebench.ai has QwQ scoring quite high in the reasoning category (and that is relatively easy to run locally but it doesn't score super high in other categories).
My gut feeling is that there may be optimization you can do for faster performance (but I could be wrong since I don't know your setup or requirements). In general on a 4090 running between Q6-Q8 quants my tokens/sec have been similar to what I see on cloud providers (for open/local models). The fastest local configuration I've tested is Exllama/TabbyAPI with speculative-decoding (and quantized cache to be able to fit more context)
I love the idea of openrouter. I hadn't realized until recently though that you don't necessarily know what quantization a certain provider is running. And of course context size can vary widely from provider to provider for the same model. This blog post had great food for thought https://aider.chat/2024/11/21/quantization.html
I experimented with both Exo and llama.cpp in RPC-server mode this week. Using an M3 Max and an M1 Ultra in Exo specifically I was able to get around 13 tok/s on DeepSeek 2.5 236B (using MLX and a 4 bit quant with a very small test prompt - so maybe 140 gigs total of model+cache). It definitely took some trial and error but the Exo community folks were super helpful/responsive with debugging/advice.
The Android apk for MLC is updated frequently with recent models built-in. And a Samsung S24+ can comfortably run 7-8B models at reasonable speeds (10ish tokens/sec).
As an additional data point - I've read Anathem multiple times and have loved it every time but I read The Book of the New Sun series and really disliked it (though to be fair, I only read TBotNS series once). But I know that TBotNS gets great reviews. So as always with books, your mileage may vary.
Gene Wolfe (the author of TBotNS) loves unreliable narrators. His books are almost like a game of trying to figure out how exactly he is lying to you, with only the flaws in the narration as your guide. They don't come to a satisfying conclusion where all is revealed to you; you may never even realize you've been tricked.
It's another example of friction, you have to realize the game you have entered, and be willing to figure out a novel (or in this case trilogy) sized puzzle, where the puzzle pieces are buried in perfectly ordinary prose. If it sounds like fun, and you're up for the challenge, his books can be a unique pleasure. If not, you will miss out on a lot of the more interesting things that are happening between the lines.
If you're interested in the concept I recommend Fifth Head of Cerberus as a better entry point into Wolfe's writing. It's a collection of 3 novellas set on the same planet. They are inter-related but each is basically its own story-puzzle, in a much more digestible size than New Sun.
The CapOne 2% business card is called Spark Cash and doesn't have the redemption restrictions that you're describing (I think you may be thinking of their Venture card or another personal travel card). The only two downsides to the Spark card (in comparison to the Fidelity cashback card or Citi double cash) are that applying for it likely dings you in all 3 major credit reports (as opposed to just one like most cards) and also that it carries a $59 annual fee after the 1st year.
My startup has a spark card and being able to get the amazon gift cards as the 2% bonus was great, but unfortunately for some reason amazon gift cards are no longer available as one of the gift card options. Had to start getting home depot and target cards instead :/.
When I was working with RAG llama.cpp through RN early last year I had pretty acceptable tok/sec results up through 7-8b quantized models (on phones like the S24+ and iPhone 15pro). MLC was definitely higher tok/sec but it is really tough to beat the community support and availability in the gguf ecosystem.