Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama 3 one from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...

The one I'm running is the 8.54GB file. I'm using Ollama like this:

    ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0
You can prompt it directly there, but I'm using my LLM tool and the llm-ollama plugin to run and log prompts against it. Once Ollama has loaded the model (from the above command) you can try those with uvx like this:

    uvx --with llm-ollama \
      llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
      'a joke about a pelican and a walrus who run a tea room together'
Here's what I got - the joke itself is rubbish but the "thinking" section is fascinating: https://gist.github.com/simonw/f505ce733a435c8fc8fdf3448e381...

I also set an alias for the model like this:

    llm aliases set r1l 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' 
Now I can run "llm -m r1l" (for R1 Llama) instead.

I wrote up my experiments so far on my blog: https://simonwillison.net/2025/Jan/20/deepseek-r1/



I got a quantized Llama 70B model working, using most of my 64GB of RAM but it's usable:

    ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:Q3_K_M
That's a 34GB download. I'm accessing it via https://github.com/open-webui/open-webui which I ran like this:

    uvx --python 3.11 open-webui serve
I have TailScale on my laptop and phone so I can run experiments directly from my phone while leaving my laptop plugged in at home.


Thanks for all your writing on these topics Simon! It has turned me from a bit of a naysayer to an optimist around this tooling, especially being able running stuff locally with access to tools. Have an M4 Pro mac mini arriving this week to have a similar self hosted setup over tailscale.


Maybe related, but I've also got a tailscale instance running for the same use case (on an older it box but...) I've also installed open-webui attached to ollama. That way the interface I can deal with when on my phone is just a simple nice to use webpage. May want to look into this? Thus far it's worked very slick.


Nice, thanks for the suggestion. I got it set up just before leaving town for a few days, so have been doing a little tinkering with it. I was hoping to have a setup with LM Studio, where my laptop could use the API Server from the mini over the TS network. Unfortunately doesn't seem to be the case, so I'll set up a configuration like you mentioned to just have a global client from any device on the netowrk.

It's very cool to be able to have access to such a high horsepower machine from anywhere though. Next step is figuring out the networking interface to be able access the host GPU/ollama API from pods running in a Colima VM/k3s cluster setup.


Time to put this 128GB M4 Mac to work...

    bash> ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:Q5_K_M
(49GB!)

(It wouldn't accept the Q_8 because it was sharded.)

Will report back lol

What app do you use on your phone to access your home LLM servers over tailscale?


The tailscale app on iOS can turn on a VPN on the phone to connect into your tailnet. Once this is enabled, you can then access the tailnet IPs from anything, e.g., in your browser to hit a web ui. You can then take it further with local DNS and a reverse proxy to access your home server services at a URL.


I'm running the Open WebUI Python web server app and accessing it on my phone via Mobile Safari.


I use the GPTMobile app on Android to access ollama over Tailscale.

https://github.com/chocydar/GPTMobile


Not to speak for him, but I’d suspect Simon would be using a terminal emulator. He often mentions preferring to use his own command line tool, named LLM, as it allows one to log everything to an SQLite database for later exploration.


Hah, I probably have a similar tool just called "ask" which logs to a JSON log instead of a sqlite db


    bash> ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:Q5_K_M
    pulling manifest
    ...
    success
    >>> Tell me a "what's the difference between" joke about techbros and another funny comparison of your choosing
    <think>
    Okay, so I need to come up with a "What's the difference between" joke comparing techbros and another group. Hmm. Let me think.

    First, what defines a techbro? They're usually associated with startups, tech gadgets, maybe working in Silicon Valley, into things like crypto, NFTs, and they often have a certain lingo or way of
    dressing—like hoodies, jeans, maybe beanies. They might be seen as overly optimistic about their ideas changing the world but sometimes disconnected from practical realities.

    Now, I need another group to compare them with. Maybe something else in tech but different enough to create humor. Let's think: developers? No, too similar. Maybe hipsters? Or perhaps venture capitalists?
    Wait, what about "developers who actually ship code"? That could work because it plays on the idea that while techbros talk a lot about their ideas, developers are more focused on delivering actual
    products.

    So the joke structure is usually two lines: one for each group, highlighting a key difference. It needs to be punchy and capture the essence of both groups in a humorous way.

    Let me draft it:

    "What's the difference between a techbro and a developer who actually ships code?"

    "Techbros are still explaining their MVP to you three years later, while developers just sent you a pull request titled 'it works now'."

    Wait, does that make sense? Let me break it down. A Minimum Viable Product (MVP) is something techbros often talk about but might take forever to actually build or ship. On the other hand, developers who
    are focused on shipping code would just send a pull request indicating they've fixed something and it's done now.

    I think that captures the difference between all talk and actual action. It's relatable for people in the tech industry who know the dynamics between idea people and implementers.

    Is there another angle? Maybe comparing them to something else, like entrepreneurs or investors, but I think the developer angle is solid because it directly contrasts talking about ideas versus executing
    them.

    So the final answer would be that joke. It highlights the techbro's tendency to over-explain without delivery, while developers focus on getting things done.
    </think>

    **What's the difference between a techbro and a developer who actually ships code?**

    *Techbros are still explaining their MVP to you three years later, while developers just sent you a pull request titled "it works now."*

    This joke humorously contrasts the tendency of techbros to focus on ideas and discussions with the action-oriented approach of developers who deliver tangible results.


IMO passable joke, not unlike modern content; if I saw it in the wild I’d laugh, feel called out and scroll down without thinking that AI may have written this.


Funny, too much musing about the joke but then actually shipping it. must be from an AI, sort of an unfunny valley.


The part between the "think" tags is (I guess) what ChatGPT-o1 hides under vague ux updates like "thinking about the problem". Perhaps the UI over ollama should hide them too.

Personally, I like to see the thought processes.


> Let's think: developers? No, too similar

Ouch


On Android use Termux.

It's a terminal emulator


> Here's what I got - the joke itself is rubbish but the "thinking" section is fascinating

Very funny. How does the 70B model do in comparison?


I've previously had a 70B Q2 wedged into a 4090. Not much space for context but it runs...


I set up a cloudflare tunnel recently so I could also use my local LLMs remotely! It’s nice.


Yes, I did the same but with Docker serving open-webui. It's definitely not fast. Took 2 minutes to answer "what is the highest peak in California," but it eventually go there. OpenAI o1 took 6 seconds.

Pasting the logic for anyone interested:

<details type="reasoning" done="true" duration="102"> <summary>Thought for 102 seconds</summary> > Okay, so I need to figure out what the tallest peak in California is. Hmm, I remember that the highest mountain in the contiguous United States is Mount Whitney, but wait, isn't that in California? Let me think. Yes, I'm pretty sure it's in California. But maybe there are other tall peaks too. > > Wait, no, Mount Whitney is definitely the tallest. It's over 14,000 feet. But sometimes people might confuse it with other mountains like Shasta or something else. Oh, wait, isn't Mount Shasta also a volcano? I think that's in California too but it's not as tall as Whitney. > > So, putting it together, Mount Whitney is the tallest peak in California at 14,505 feet. It's located in the Sierra Nevada range. People hike there a lot because it's such a high point. But I should double-check to make sure I'm not mixing up anything else. Yeah, that seems right. > I wonder if there are any taller peaks elsewhere. Oh, Denali in Alaska is higher, but that's not in California. So within the state, Whitney is the highest. Got it. > So the answer is Mount Whitney at 14,505 feet above sea level. It's a popular spot for hikers and climbers. Makes sense because it's such an iconic peak in the Sierra Nevada mountains. > I think I've got all that right. No other peaks in California are taller than Whitney. So yeah, Mount Whitney is the tallest. > Just to make sure, sometimes there are different names or maybe alternate names for peaks, but Whitney is pretty straightforward. Yeah, it's definitely the one. > Okay, so I'm confident now. The answer is Mount Whitney. > *Final Answer* > The tallest peak in California is Mount Whitney, with an elevation of 14,505 feet above sea level. > > \boxed{Mount Whitney} </details> To determine the tallest peak in California, we consider notable mountains such as Mount Whitney and Mount Shasta. Mount Whitney is recognized as the highest peak in the contiguous United States, standing at 14,505 feet. It is located in the Sierra Nevada mountain range and is a popular destination for hikers. Mount Shasta, although also a significant peak, does not surpass Whitney's elevation. Therefore, the tallest peak in California is confirmed to be Mount Whitney.

*Final Answer* The tallest peak in California is \boxed{Mount Whitney}.


> Wait, maybe the punchline is something like: "We don’t have any fish in the tea, but we do have a lot of krill."

Shucks, it was so close to coming up with a good punchline it could work back from.

I'm thinking set it in a single-cell comic. A downtrodden young man or woman sitting alone at a table, a pelican in the background clearly making drinks in its voluminous beak, and the walrus waiter places a cup in front of the person, consolingly saying "there's plenty of fish in the tea".


I chuckled, that's a very clean punchline. Nicely done.

These days I sometimes feel as if my skills as an iOS dev are losing market value. Maybe it's time I get back to writing standup :)



Haha I was secretly hoping someone would, thanks! Is this your artwork or generated? Either way I'm impressed!


Answering my own question - I think generated, because (took me WAY too long to notice) the dude has three arms.


lmao, yes its ai generated, custom FLUX model trained on a specific meme-style but i can actually use it for any drawings

there was also a slight 'editing by curating as i choose one of the 4 outputs i got

being generated more obv on the others (i think i missed that third arm too lmao)

https://replicate.delivery/xezq/NkTMmG5ZLrIpLhr38zNR6RMeN4C2...

https://replicate.delivery/xezq/lzZSIkCGnXKFI90Rw3ANFwOvrZBn...

noticed a typo on my first prompt had to rerun without

https://replicate.delivery/xezq/7MU7JhaNyiZlN18gXZrfPkn9ZMgG...

https://replicate.delivery/xezq/d5ELWfhSNk2REqtFe6fn1ouUIf5n...

https://replicate.delivery/xezq/YQwQ6tMLQKZ2LlwkXdBu1OUR9Yue...

https://replicate.delivery/xezq/VkfUjywopD0SEim4uERM22oetff3...


"Meanwhile, the pelican tries to help by filtering it with its webbed feet."

I think it doesn't quite get what webbed feet mean.


It is a weird term. Why not paddle feet. "Web" for a membrane makes little sense.


but we do have a lot of time to krill would have worked better. I saw it instantly from your post, not sure if it's funny or not though.


While not world class, both of those are instantly better than the bag ‘o tokens served by the LLM.

Honestly it’s a bit like the porn definition. We know intelligence when we see it, even if it’s hard to define. Adding more tokens doesn’t magically phase-change into real smarts. We’ll get there, but not today.


And yet none of it would have ever been experienced without the guidance toward this thinking by an LLM... ;)

Maybe future LLMs will be their most pro-social by playing stupid, and letting us feel like we came up with all the good idea they gracelessly-but-gracefully point our attention toward (not saying that's happening here, but maybe it's where we should make it stop)


I think the problem is that humor isn't about reasoning and logic, but almost the reverse - it's about punchlines that surprise us (i.e. not what one would logically anticipate) and perhaps shock us by breaking taboos.

Even masters of humor like Seinfeld, with great intuition for what might work, still need to test new material in front of a live audience to see whether it actually does get a laugh or not.


Good point, but this is an 8GB model. That's the problem. This is a spectacular answer given that constraint. Also it's hilarious how bad the final answer is.


Yes - impressive how good the small models are getting, and this "reasoning distillation" seems to have given them a significant boost.

Even though humor is largely about the unanticipated punchline, I'd have guessed (maybe wrongly) that there'd be enough analytical discussion of humor in the training set for a reasoning model to come up with a much more plausible attempt at a formulaic type of joke.

From the example given it seems there's too much "thought" put into "what do I have to work with here", and not enough into conceiving/selecting a template for the joke. Maybe part of the problem is that the LLM doesn't realize that, being an LLM, it's best chance at being funny to a human is to closely stick to a formula that humans find funny, and not try to be too smart in trying to deconstruct it.


small models can be funny (12b model):

tell me funny story about techbro and real developer who actually ships code

Sure, here's a light-hearted story for you:

---

Once upon a time in Silicon Valley, there was a shiny new startup called "TechBro Inc." It was filled with eager young techbros, all wearing their "10x Developer" t-shirts, and sipping on their $8 cold-pressed juices. Among them was Chad, the self-proclaimed "Tech Bro King."

Chad was always posting his impressive code snippets on his social media profiles, bragging about how he could "crush" any coding challenge in no time. He was the epitome of a techbro, all talk and no walk.

Meanwhile, in the corner, there was Alex, the quiet, unassuming developer who actually shipped code. Alex didn't have flashy t-shirts or a massive online following, but the company's products worked, and that was thanks to Alex.

One day, TechBro Inc. decided to host a "Hackathon" to impress their investors. Chad saw this as his opportunity to shine. He spent the entire day tweeting about his progress, posting screenshots of his code, and even live-streaming his "epic coding session."

Alex, on the other hand, quietly sat at his desk, coding away. He didn't take any breaks for photoshoots or live streams. He was there to work, not to perform.

As the hackathon came to a close, Chad confidently declared, "I've just finished the most amazing feature ever! It's going to revolutionize our app!" The room erupted in applause, and Chad basked in the glory.

The next day, the team gathered to demo their new features. Chad confidently presented his "revolutionary" feature. He clicked the button to launch it, but nothing happened. He tried again, and again, but the app remained unchanged. Chad's face turned red, and sweat dripped down his forehead.

It was Alex's turn. He quietly stepped up, clicked a button, and suddenly, the app was filled with new, useful features. The team gasped in awe. The investors were impressed. The app actually worked, and that was thanks to Alex.

Chad, trying to save face, turned to Alex and said, "Well, Alex, it seems you've just... shipped some code. Not bad for a... real developer."

The room burst into laughter. From that day forward, Chad learned the difference between talking about coding and actually doing it. And Alex? He just kept shipping code, one line at a time.

---


That is beyond terrible. Are you joking? I feel sad that I spent a minute reading that.


Yeah, getting LLMs to invent jokes is an incredibly unfair test of them - but I derive enormous entertainment from seeing them try.


LLMs output can often be like a prism in front of a mirror - a fusion of humanity with the specific human, reflected back at the human.

Simon, perhaps you're just not funny.


How Nietzschian


I just tried Claude Sonnet with the Pelican & Walrus setup, but asked it for something in style of Norm Macdonald's humor, which would seem a potentially good fit for this type of idea. It got the idea of a rambling story (cf Macdonald's moth joke) that mostly ignored them being a Pelican and Walrus, which seemed promising, but still failed to deliver. I'm guessing with some more guidance and iteration it could have come up with something.


I've been trying out "write and perform an MST3K episode reviewing your previous answer" as a prompt. Have yet to get a funny answer from any model.

Older base models produce some great stuff though, at least if you're me and think RoboRosewater and dril_gpt2 are funny.


I've found that the weaker the model, the more of an unhinged, LSD-trip response you get. I made a podcast years ago (https://deepdreams.stavros.io) and GPT-2 was definitely the more hilarious of the models.


> Have yet to get a funny answer from any model.

Just like the real thing! Promising results!


I'd love to see a standup competition of trying to make AI jokes/sets funny.


All this reminds me of another genius comedy duo, David Lynch and Louis CK: https://m.youtube.com/watch?v=HlEJbs02wAM

Come on, LLM, be funny! 3 2 1 go!


> the joke itself is rubbish but the "thinking" section is fascinating:

This is gold. If I was a writer, I'd wring value from that entire thinking-out-loud section and toss the actual punchline.

This is weirdly reminiscent of co-programming with CodyAI. It gives me a lot of good 'raw material' and I'm left integrating the last mile stuff.


I like to imagine the entire transcript as a Norm Macdonald bit, a la his moth joke


looking at how it thinks, it can help bust you out when you write yourself into a corner.


Thanks! Playing around with this vs the https://ollama.com/tripplyons/r1-distill-qwen-7b variant and find 7b to be somewhat of sweet spot of getting to the point with minimal (or less) waffle.

Certainly, interesting reading their thought processes, value in that might be greater than the answer itself depending on use-case.


This joke is so terrible, I think this might end up being how AI kills us all when it decides it needs us out of the way to make more paperclips.


Wait until you see what I got when prompting it for a joke about three priests:

> Why did the rabbi become a boxer?

> Because he couldn't handle the holy punches!

> But wait, that's not quite right. Let me try again.

> Here’s one: What do you call a Jewish, Orthodox, and Catholic priest working together?

> A trio of men who probably shouldn’t be telling jokes after their day job!

> Sorry for the confusion earlier—I hope this one makes more sense!

It did not.


That second one isn’t funny, but it’s still interesting somehow.

Actually, it would work as a joke, but it would have to be told by an actual priest hanging out after work with two other actual priests. Ultra-niche humor.


> What do you call a Jewish, Orthodox, and Catholic priest working together?

A miracle.


Yeesh, that shows a pretty comprehensive dearth of humour in the model. It did a decent examination of characteristics that might form the components of a joke, but completely failed to actually construct one.

I couldn't see a single idea or wordplay that actually made sense or elicited anything like a chuckle. The model _nearly_ got there with 'krill' and 'kill', but failed to actually make the pun that it had already identified.


Yeah it's very interesting... It appears to lead itself astray: the way it looks at several situational characteristics, gives each a "throw-away" example, only to then mushing all those examples together to make a joke seems to be it's downfall in this particular case.

Also I can't help but think that if it had written out a few example jokes about animals rather than simply "thinking" about jokes, it might have come up with something better


why shouldn’t i assume that the “thinking” is just the usual LLM regurgitation of “how would a human coming up with a joke explain their reasoning?” or something like that, and zero “thinking”?


Given how it is trained specifically (they didn't encourage it to think, they allowed it to) there was a lot of emergent behavior as it trained.

Sort of like chess engines rediscovering classic (named) chess openings. See section 2.2.3 for the training template (it's a single paragraph I can't reproduce here because I'm on my phone)

Example emergent behavior (section 2.2.4 page 8): the model learns to solve more complex problems by spending more time reasoning. It also naturally develops reflection (what have I tried?) and exploration strategies.

Fundamentally, you should think of this as a nn that learned to solve real problems by reasoning about them in written language.

(My favorite part: it defaulted to reasoning in multiple languages. They constrained it to only reason in a single language and this negatively impacted performance! But the hypothesis is that it improves interpretability)


Ever read philosophy? An acquaintance can and will readily mix and match 3 languages to obtain more precision.


You shouldn't assume that because you shouldn't assume the converse either.

The LLM will use the chain of thought, as it's being built out like a garden path, to explore different completion possibilities. It's not necessarily logically related what it completes afterwards, but it's definitely influenced.

The search may not be valuable as an artifact in itself. It's likely to be logically unsound especially in parts. But the end result may be useful.


What’s the difference?


At least to me, the difference is that one is ostensibly an explanation of how the AI arrived at the joke, the other is a post-hoc explanation of the joke.


You can be pretty sure the AI isn't doing a post-hoc explanation because the only writable memory it has access to is the tokens it has already output - i.e. the explanation of the joke. Everything else is reset between every token.

As long as it comes up with different jokes different times you ask it (assuming randomness in sampling) - how could it.


The problem is it can’t remember what it hasn’t written but the end result still makes sense, so there has to be some goal after parsing the initial context that the tokens are emitted towards to. This means there’s nothing stopping it from producing an explanation, it might be in there from the very start.


Not one goal though, but a set of goals. Otherwise the random sampling of tokens wouldn't result in it outputting meaningfully different jokes.

I also think it's safe to assume that the set of goals isn't fully resolved, but it's harder to "prove" that.


There's no goal. The tokens are computed one at a time from all the previous tokens.


One is orthogonal to the other.


Can someone ELI5 what the difference is between using the "quantized version of the Llama 3" from unsloth instead of the one that's on ollama, i.e. `ollama run deepseek-r1:8b`?


The weights are quantized down to fewer bits in order to save on memory. The quantization loss is going to result in worse generations.


Ollama serves multiple versions, you can get Q8_0 from it too:

ollama run deepseek-r1:8b-llama-distill-q8_0

The real value from the unsloth ones is that they were uploaded before R1 appeared on Ollama's model list.


Unsloth also works very diligently to find and fix tokenizer issues and many other problems as soon as they can. I have comparatively little trust on ollama following up and updating everything in a timely manner. Last I checked, there is little information on when the GGUFs and etc. on ollama were updated or what llama.cpp version / git commit did they use for it. As such, quality can vary and be significantly lower with the ollama versions for new models I believe.


They are probably the same model, unsloth does model quants and provides them to the community, AFAIK ollama doesn't, they just indexes publicly available models, whether full or quantized, for convenient use in their frontend.


FWIW, you can also try all of the distills out in BF16 on https://glhf.chat (either in the UI or via the API), including the 70b. Personally I've been most impressed with the Qwen 32b distill.

(Disclosure: I'm the cofounder)


(And here's the output of the "pelican and the walrus in a tearoom" joke from DeepSeek's distilled 70b... Mildly better?)

<think> Alright, I'm supposed to come up with a joke about a pelican and a walrus in a tea room. Hmm, sounds like a classic setup for a funny situation. Let me think about the characteristics of these animals that could be used humorously. Okay, so pelicans are known for their large beaks, right? They can carry a lot of water in them. Walruses have those big tusks and blubbery bodies. Maybe I can play on the idea of them trying to have a tea party but their physical traits causing some trouble.

Wait, being in a tea room, they might be trying to sip tea, but their size or features get in the way. The pelican's beak could be problematic with the cup. And the walrus's size might knock things over or make a mess.

Let me consider the flow. Maybe the pelican struggles with the teacup because of his beak, and the walrus has a comment about it. Then the walrus tries to sip but ends up splashing tea everywhere. That could lead to a funny exchange.

Putting it together: Pelican says something self-deprecating about his beak not being suited for tea. Then the walrus, who perhaps agrees, tries to sip and splashes tea, making them both laugh. That feels natural and funny.

I should make sure the punchline is clear and the dialogue flows smoothly. Maybe start with the pelican struggling, then the walrus makes a comment, and their interaction leads to the humorous conclusion. Yeah, that should work. </think>

Why did the pelican and the walrus get kicked out of the tea room?

Because the pelican kept dipping his beak in the wrong cup, and the walrus kept breaking the delicate china with his tusks. They just couldn’t “sea” themselves fitting in!


Apparently I'm a sucked for bad jokez...

It wasn't that good but the punchline actually got me a giggle...


Might want to let people know that you have to pay to try things ahead of time. It's not obvious when on the launch page.


The launch page has a pretty prominent pricing link — what would make it more obvious?


Something about pricing above the fold: https://ibb.co/rFR41rK


Good point! Thank you for the screenshot :)


What's your sense of how useful local LLMs are for things other than ... writing blog posts about experimenting with local LLMs? :-)

(This is a serious question, not poking fun; I am actually curious about this.)


Six months ago I had almost given up on local LLMs - they were fun to try but they were so much less useful than Sonnet 3.5 / GPT-4o that it was hard to justify using them.

That's changed in the past two months. Llama 3 70B, Qwen 32B and now these R1 models are really impressive, to the point that I'm considering trying to get real work done with them.

The catch is RAM: I have 64GB, but loading up a current GPT-4 class model uses up around 40GB of that - which doesn't leave much for me to run Firefox and VS Code.

So I'm still not likely to use them on a daily basis - but it does make me wonder if I should keep this laptop around as a dedicated server next time I upgrade.


Thanks!

One reason why I'm asking is that I'm in the market for a new laptop and am wondering whether it's worth spending more for the possible benefits of being able to run ~30-40GB local LLMs.

Unfortunately it doesn't look as if the answer is either "ha ha, obviously not" or "yes, obviously". (If the question were only about models available right now I think the answer would be no, but it seems like they're close enough to being useful that I'm reluctant to bet on them not being clearly genuinely useful a year from now.)


Yeah, it's not an obvious answer at all. Spending ~$3,000+ on a laptop to run local models is only economically sensible if you are VERY paranoid about using APIs (there are plenty of API providers that I personally trust not to train on my data) - otherwise that $3,000 will buy you many years worth of access to the best available models via API.


Well, I unfortunately have expensive tastes in laptops anyway, so the delta is substantially less than $3k, and it's possible that from time to time I'll run across other things that benefit from the fancier machine, and if I don't get a 64GB Mac one of the other possibilities is a 48GB Mac which would still be able to run some local LLMs. But, all that said, it's still potentially a sizable chunk of money for a dubious benefit.

I've been assuming that privacy isn't the only benefit of local; it seems like a local model would offer more flexibility for fine-tuning, RAG, etc., though I am completely ignorant of e.g. what size of model it's actually feasible to do any useful fine-tuning to on given hardware.


If you are worried about security or IP at all, it's preferable to run locally, or spin up your own box that you can query running one of these models.


I understand the reasons for not wanting to use a remote LLM. My question was about how useful local LLMs are. It might turn out that for some people remote LLMs are unacceptable for privacy reasons and local LLMs are unacceptable because they aren't good enough to be useful.


Did you try the universal LLM cheat code as a followup prompt?

"Make it better"


These "thinking" LLMs are essentially productizing that trick. All the rambling and second guessing they go through at the beginning gives them source material to compute the "better" answer from at the end.


Sure, but you can still re-prompt them again telling them to just do better.

In case people missed it, I'm referencing an observation recently made by 'minimaxir, described here:

https://minimaxir.com/2025/01/write-better-code/

As it turns out, you can improve the quality of code generated by some LLMs by repeatedly responding to it with just three words: "write better code".

While Max only tested this on Claude 3.5 Sonnet, I see no reason why this wouldn't work with the "thinking" models either. Even if it doesn't the results might still be interesting. With that in mind, here's the article's experiment applied to o1-preview:

https://cloud.typingmind.com/share/69e62483-45a4-4378-9915-6...

Eyeballing the output, it seems to align with the article's observation.

(o1-preview is the only "thinking" model I currently have API access to; official ChatGPT app doesn't let me reply to o1 family, forcing further interactions to be with "non-thinking" models instead.)


> The walrus might say something like, "We have the biggest catch in town," while the pelican adds a line about not catching any fish recently.

It should've stopped there :D


Tried exactly the same model. And unfortunately the reasoning is just useless. Built it is still not able to tell how many r's in strawberry.


That's a tokenizer issue though?


Not 100% so for chain of thought models, they should recognize to spell the word letter by letter in some separated form and then count the tokens in that form. The Qwen distill seems to do exactly this really well:

> Step-by-step explanation:

> 1. Break down each word: "not", "really", "a", "tokenizer", "issue".

> 2. Count 'e's in each word:

> - "not": 0

> - "really": 1

> - "a": 0

> - "tokenizer": 2

> - "issue": 1

> 3. Sum the counts: 0 + 1 + 0 + 2 + 1 = 4.

>

> Answer: There are 4 E's in the phrase.

In the thought portion it broke the words up every which way you could think to check then validated the total by listing the letters in a number list by index and counting that compared to the sums of when it did each word.


But the only way to do this is if it is trained on how to map the word token to character tokens ie

Hello -> h e l l o 66547 -> 12 66 88 88 3

Or, maybe it memorized that hello has a single e.

Either way, This seems to be a edge case that may or may not exist in the training data, but seems orthogonal to 'reasoning'

A better test case would be how it performs if you give the spelling mappings for each word the context?


"Be trained how to map" implies someone is feeding in a list of every token and what the letters for that token are as training data and then training that. More realistically, this just happens automatically during training as the model figures out what splits work with which tokens because that answer was right when it came across a spelling example or question. The "reasoning" portion comes into play by its ability to judge whether what it's doing is working rather than go with the first guess. E.g. feeding "zygomaticomaxillary" and asking for the count of 'a's gives a CoT

> <comes to an initial guess> > Wait, is that correct? Let me double-check because sometimes I might miscount or miss letters. > Maybe I should just go through each letter one by one. Let's write the word out in order: > <writes one letter per line with the conclusion for each > *Answer:* There are 3 "a"s in "zygomaticomaxillary."

It's not the only example of how to judge a model but there are more ways to accurately answering this problem than "hardcode the tokenizer data in the training" and heavily trained CoT models should be expected to hit on at least several of these other ways or it is suspect they miss similar types of things elsewhere.


The important thing is, when we're all replaced by robots, deep down we will know we are superior because we can count letters in strawberry.


Super interesting. It seems to get hung up on a few core concepts, like the size of the walrus vs. the limited utility of a pelican beak.


This is probably pretty similar to my inner monologue as I would try and inevitably fail to come up with a good joke.


Have you had a chance to compare performance and results between the Qwen-7B and Llama-8B versions?


“I never really had a childhood”, said Walrus, blowing on his tea with a feigned sigh. “Why’s that?” asked Pelican, refilling a sugar shaker. Walrus: “I was born long in the tooth!” Pelican: [big stupid pelican laughing noise]


Why ask it for a joke? That's such a bad way to try out a reasoning model


Would that be funny? Failing a job interview because, when ask to make a joke about a pelican and a walrus who run a tea room together, you couldn't come up with a quip, therefore showing your inability to reason.

Let's not even talk about the "r" you forgot when asked to write "cranberry"...


Creating a joke is a great demonstration of a multistage reasoning/deduction process


It's a great test. Creating a joke that is both funny and novel requires many steps.


Watching them attempt to reason their way to a joke is really amusing.


I asked deepseek-14b for a joke in the style of Norm Mcdonald, and it delivered an overexplained deadpan absurd joke that I thought was pretty similar to Norms style. Then I tried again with different styles and realized that all jokes it tries to make are in that same style as if it tries to follow a manual on how to construct a joke without actually being funny. Which is pretty funny


frankly ollama + Deepseek is all you need to win with open source AI. I will do some experiments today and add it to my initial blogpost. https://medium.com/thoughts-on-machine-learning/deepseek-is-...


>a joke about a pelican and

Tell me you're simonw without telling me you're simonw...


Can you recommend hardware needed to run these?


I'm using an M2 64GB MacBook Pro. For the Llama 8B one I would expect 16GB to be enough.

I don't have any experience running models on Windows or Linux, where your GPU VRAM becomes the most important factor.


On Windows or Linux you can run from RAM or split layers between RAM and VRAM; running fully on GPU is faster than either of those, but the limit on what you can run at all isn’t VRAM.


So is it possible to load the ollama deepseek-r1 70b (43gb) model on my 24gb vram + 32gb ram machine? Does this depend on how I load the model, i.e., with ollama instead of other alternatives? Afaik, ollama is basically llama.cpp wrapper.

I have tried to deploy one myself with openwebui+ollama but only for small LLM. Not sure about the bigger one, worried if that will crash my machine someway. Are there any docs? I am curious about this and how that works if any.


Why isn't GPU VRAM a factor on a Silicon mac?


Because there's no VRAM. The “regular” RAM on Apple Silicon devices is shared with the GPU virtually in its entirety.


ollama runs deepseek-r1:7b on AMD 8945HS, CPU-only, at ~12 tokens/s. You can get started pretty easily in the ~7B model range, for learning purposes.


I think "reasoning" models will solve the joke issue (amongst other issues), but not because they're "reasoning". Rather because they help solve the exploration issue and the scaling issue.

Having worked with LLMs a lot for my JoyCaption project, I've got all these hypothesis floating around in my head. I guess the short version, specifically for jokes, is that we lack "joke reasoning" data. The solution, like mathematical problems, is to get the LLM to generate the data and then RL it into more optimal solutions.

Longer explanation:

Imagine we want an LLM to correctly answer "How many r's are in the word strawberry?". And imagine that language has been tokenized, and thus we can form a "token space". The question is a point in that space, point Q. There is a set of valid points, set A, that encompasses _any_ answer to this question which is correct. There are thus paths through token space from point Q to the points contained by set A.

A Generator LLM's job is, given a point, predict valid paths through token space. In fact, we can imagine the Generator starting at point Q and walking its way to (hopefully) some point in set A, along a myriad of inbetween points. Functionally, we have the model predict next token (and hence point in token space) probabilities, and we can use those probabilities to walk the path.

An Ideal Generator would output _all_ valid paths from point Q to set A. A Generator LLM is a lossy compression of that ideal model, so in reality the set of paths the Generator LLM will output might encompass some of those valid paths, but it might also encompass invalid paths.

One more important thing about these paths. Imagine that there is some critical junction. A specific point where, if the Generator goes "left", it goes into a beautiful flat, grassy plain where the sun is shining. That area is really easy to navigate, and the Generator LLM's predictions are all correct. Yay! But if it goes "right" it ends up in the Fire Swamp with many dangers that it is not equipped to handle. i.e. it isn't "smart" enough in that terrain and will frequently predict invalid paths.

Pretraining already taught the Generator LLM to avoid invalid paths to the best of its abilities, but again its abilities are limited.

To fix this, we use RL. A Judge LLM takes a completed path and determines if it landed in the set A or not. With an RL algorithm and that reward signal, we can train the Generator LLM to avoid the Fire Swamp, since it often gets low rewards there, and instead goes to the Plain since it often gets rewards there.

This results in a Generator LLM that is more _reliable_ and thus more useful. The RL encourages it to walk paths it's good at and capable of, avoid paths it struggles with, and of course encourages valid answers whenever possible.

But what if the Generator LLM needs to solve a really hard problem. It gets set down at point Q, and explores the space based on its pretraining. But that pretraining _always_ takes it through a mountain and it never succeeds. During RL the model never really learns a good path, so these tend to manifest as hallucinations or vapid responses that "look" correct.

Yet there are very easy, long paths _around_ the mountain that gets to set A. Those don't get reinforced because they never get explored. They never get explored because those paths weren't in the pretraining data, or are so rare that it would take an impractical amount of exploration for the PT model to output them.

Reasoning is one of those long, easy paths. Digestible small steps that a limited Generator LLM can handle and use to walk around the mountain. Those "reasoning" paths were always there, and were predicted by the Ideal Generator, but were not explored by our current models.

So "reasoning" research is fundamentally about expanding the exploration of the pretrained LLM. The judge gets tweaked slightly to encourage the LLM to explore those kinds of pathways, and/or the LLM gets SFT'd with reasoning data (which is very uncommon in its PT dataset).

I think this breakdown and stepping back is important so that we can see what we're really trying to do here: get a limited Generator LLM to find its way around areas it can't climb. It is likely true that there is _always_ some path from a given point Q and set A that a limited Generator LLM can safely traverse, even if that means those paths are very long.

It's not easy for researchers to know what paths the LLM can safely travel. So we can't just look at Q and A and build a nice dataset for it. It needs to generate the paths itself. And thus we arrive at Reasoning.

Reasoning allows us to take a limited, pretrained LLM, and turn it into a little path finding robot. Early during RL it will find really convoluted paths to the solution, but it _will_ find a solution, and once it does it gets a reward and, hopefully, as training progresses, it learns to find better and shorter paths that it can still navigate safely.

But the "reasoning" component is somewhat tangential. It's one approach, probably a very good approach. There are probably other approaches. We just want the best ways to increase exploration efficiently. And we're at the point where existing written data doesn't cover it, so we need to come up with various hacks to get the LLM to do it itself.

The same applies to jokes. Comedians don't really write down every single thought in their head as they come up with jokes. If we had that, we could SFT existing LLMs to get to a working solution TODAY, and then RL into something optimal. But as it stands PT LLMs aren't capable of _exploring_ the joke space, which means they never come out of the RL process with humor.

Addendum:

Final food for thought. There's kind of this debating going on about "inference scaling", with some believing that CoT, ToT, Reasoning, etc are all essentially just inference scaling. More output gives the model more compute so it can make better predictions. It's likely true that that's the case. In fact, if it _isn't_ the case we need to take a serious look at our training pipelines. But I think it's _also_ about exploring during RL. The extra tokens might give it a boost, sure, but the ability for the model to find more valid paths during RL enables it to express more of its capabilities and solve more problems. If the model is faced with a sheer cliff face it doesn't really matter how much inference compute you throw at it. Only the ability for it to walk around the cliff will help.

And, yeah, this all sounds very much like ... gradient descent :P and yes there have been papers on that connection. It very much seems like we're building a second layer of the same stuff here and it's going to be AdamW all the way down.


I’m on my phone so I can’t give this a proper response but I want to say that your mental intuition about the latent space algorithms is excellent and has improved my thinking. I haven’t seen much writing applying pathfinding (what we used to call AI, in the Half Life days) terminology to this. Your ideal generator sounds like letting A* run on all nodes in a grid and not exiting when the first path is found.

Mountains and cliffs are a good way to describe the terrain of the topology of the weights in hyper dimensional space though they are terms for a 2D matrix.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: