Hacker Newsnew | past | comments | ask | show | jobs | submit | reedlaw's commentslogin

Great feedback. On the Inverness to Gibraltar race the leaderboard has impossible times, including some negative numbers. According to Google Maps the best time is 1 day 8 hours, but this requires leaving Inverness at 6:44. The race starts at 9:00. Then the earliest arrival time in Paris is 22:19. According to Google Maps the best time from Paris to Gibraltar at this time of night is 1 day 2 hours. Despite all this, there are several 1 day records. The best I can do is 2 days 5 hours. Of course I may be missing some better route, but I suspect cheating, especially from names like AdolfHitler.

This is a fine and useful project, but my experience with newly printed classics is the quality is inferior for a number of reasons. Besides paper and binding, typesetting is something that older editions rarely messed up, but some new editions create a facsimile by scanning all the pages and then re-printing. That means that instead of getting the crisply defined letters of an old printing press, you get fuzzy letters and scan artifacts. This (https://printableclassics.com/harvard_classics) shows what I mean. Not only is the typesetting quality worse, but the price is much higher for the new edition. I don't have a problem with the price on Printable Classics ($885 for a new 50 volume set is reasonable), but you could often find the same thing cheaper used. A used set is $300-$600 on ebay. The value of these PDFs is that you could make a higher quality edition as long as the text is OCR'ed and properly typeset (which is true of the Moby Dick version on the site). For the scanned copies, it would be a big undertaking to re-typeset, but I'm sure LLMs could help.

I wonder how good a job the ClearScan feature on Adobe Acrobat would do. IIRC it creates one or more fonts based the existing characters in the PDF. So each lowercase 'a' would look the same, and be a sort of average of all the 'a' letters in the book.

Interesting – I have some old British academic press books with "advanced" typographical features; I'll see how it does with them. (The type you have to break out advanced typography features in Adobe apps to set – old style and modern numerals, interpuncts, two types of ampersands - one for headings and body text, swash alternate forms in titles.)

That's neat. I'll look into that.

Sorry, apparently they renamed it to 'Editable text and images'.

I agree. I wish I had the time to retypeset those. I would be concerned though with mildew/mold on old, used versions though.

That's true. Older books require skill to restore, so choosing old or new is a balance.

This article, (https://michaelmangialardi.substack.com/p/the-celestial-mirr...), came to similar conclusions as the parent article, and includes some tests (e.g. https://colab.research.google.com/drive/1kTqyoYpTcbvaz8tiYgj...) showing that LLMs, while good at understanding, fail at intellectual reasoning. The fact that they often produce correct outputs has more to do with training and pattern recognition than ability to grasp necessity and abstract universals.

They neither understand nor reason. They don’t know what they’re going to say, they only know what has just been said.

Language models don’t output a response, they output a single token. We’ll use token==word shorthand:

When you ask “What is the capital of France?” it actually only outputs: “The”

That’s it. Truly, that IS the final output. It is literally a one-way algorithm that outputs a single word. It has no knowledge, memory, and it’s doesn’t know what’s next. As far as the algorithm is concerned it’s done! It outputs ONE token for any given input.

Now, if you start over and put in “What is the capital of France? The” it’ll output “ “. That’s it. Between your two inputs were a million others, none of them have a plan for the conversation, it’s just one token out for whatever input.

But if you start over yet again and put in “What is the capital of France? The “ it’ll output “capital”. That’s it. You see where this is going?

Then someone uttered the words that have built and destroyed empires: “what if I automate this?” And so it was that the output was piped directly back into the input, probably using AutoHotKey. But oh no, it just kept adding one word at a time until it ran of memory. The technology got stuck there for a while, until someone thought “how about we train it so that <DONE> is an increasingly likely output the longer the loop goes on? Then, when it eventually says <DONE>, we’ll stop pumping it back into the input and send it to the user.” Booya, a trillion dollars for everyone but them.

It’s truly so remarkable that it gets me stuck in an infinite philosophical loop in my own head, but seeing how it works the idea of ‘think’, ‘reason’, ‘understand’ or any of those words becomes silly. It’s amazing for entirely different reasons.


Yes, LLMs mimic a form of understanding partly through the way language embeds concepts that are preserved when embedded geometrically in vector space.

Your continued use of the word “understanding” hints at a lingering misunderstanding. They’re stateless one-shot algorithms that output a single word regardless of the input. Not even a single word, it’s a single token. It isn’t continuing a sentence or thought it had, you literally have to put it into the input again and it’ll guess at the next partial word.

By default that would be the same word every time you give the same input. The only reason it isn’t is because the fuzzy randomized selector is cranked up to max by most providers (temp + seed for randomized selection), but you can turn that back down through the API and get deterministic outputs. That’s not a party trick, that’s the default of the system. If you say the same thing it will output the same single word (token) every time.

You see the aggregate of running it through the stateless algorithm 200+ times before the collection of one-by-one guessed words are sent back to you as a response. I get it, if you think that was put into the glowing orb and it shot back a long coherent response with personality then it must be doing something, but the system truly only outputs one token with zero memory. It’s stateless, meaning nothing internally changed, so there is no memory to remember it wants to complete that thought or sentence. After it outputs “the” the entire thing resets to zero and you start over.


I'm using the Aristotelian definition of my linked article. To understand a concept you have to be able to categorize it correctly. LLMs show strong evidence of this, but it is mostly due to the fact that language itself preserves categorical structure, so when embedded in geometrical space by statistical analysis, it happens to preserve Aristotelian categories.

isn't intellectual reasoning just pattern recognition + a forward causal token generation mechanism?

You can replicate an LLM:

You and a buddy are going to play “next word”, but it’s probably already known by a better name than I made up.

You start with one word, ANY word at all, and say it out loud, then your buddy says the next word in the yet unknown sentence, then it’s back to you for one word. Loop until you hit an end.

Let’s say you start with “You”. Then your buddy says the next word out loud, also whatever they want. Let’s go with “are”. Then back to you for the next word, “smarter” -> “than” -> “you” -> “think.”

Neither of you knew what you were going to say, you only knew what was just said so you picked a reasonable next word. There was no ‘thought’, only next token prediction, and yet magically the final output was coherent. If you want to really get into the LLM simulation game then have a third person provide the first full sentence, then one of you picks up the first word in the next sentence and you two continue from there. As soon as you hit a breaking point the third person injects another full sentence and you two continue the game.

With no idea what either of you are going to say and no clue about what the end result will be, no thought or reasoning at all, it won’t be long before you’re sounding super coherent while explaining thermodynamics. But one of the rounds someone’s going to mess it up, like “gluons” -> “weigh” -> “…more?…” -> “…than…(damnit Gary)…” but you must continue the game and finish the sentence, then sit back and think about how you just hallucinated an answer without thinking, reasoning, understanding, or even knowing what you were saying until it finished.


that's not how llms work. study the transformer architecture. every token is conditioned not just on the previous token, but each layer's activation generates a query over the kv cache of the previous activations, which means that each token's generation has access to any higher order analytical conclusions and observations generated in the past. information is not lost between the tokens like your thought exercise implies.

“The cow goes ‘mooooo’”

“that’s not how cow work. study bovine theory. contraction of expiratory musculature elevates abdominal pressure and reduces thoracic volume, generating positive subglottal pressure…”


Obviously not. In actual thinking, we can generate an idea, evaluate it for internal consistency and consistency with our (generally much more than linguistic, i.e. may include visual imagery and other sensory representations) world models, decide this idea is bad / good, and then explore similar / different ideas. I.e. we can backtrack and form a branching tree of ideas. LLMs cannot backtrack, do not have a world model (or, to the extent they do, this world model is solely based on token patterns), and cannot evaluate consistency beyond (linguistic) semantic similarity.

There's no such thing as a "world model". That is metaphor-driven development from GOFAI, where they'd just make up a concept and assume it existed because they made it up. LLMs are capable of approximating such a thing because they are capable of approximating anything if you train them to do it.

> or, to the extent they do, this world model is solely based on token patterns

Obviously not true because of RL environments.


> There's no such thing as a "world model"

There obviously is in humans. When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world. This is presumably done by having associations in our brain between all the different qualia sequences and other kinds of representations in our mind. I.e. we know we do some visuospatial reasoning tasks using sequences of (imagined) images. Imagery is one aspect of our world model(s).

We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens. A VLM or other multimodal might be able to do so, but an LLM can't, and so an LLM can't have a visual world model. They might in special cases be able to construct a linguistic model that lets them do some computer vision tasks, but the model will itself still only be using tokenized words.

There are all sorts of other sensory modalities and things that humans use when thinking (i.e. actual logic and reasoning, which goes beyond mere semantics and might include things like logical or other forms of consistency, e.g. consistency with a relevant mental image), and the "world model" concept is supposed, in part, to point to these things that are more than just language and tokens.

> Obviously not true because of RL environments.

Right, AI generally can have much more complex world models than LLMs. An LLM can't even handle e.g. sensor data without significant architectural and training modification (https://news.ycombinator.com/item?id=46948266), at which point, it is no longer an LLM.


> When you visually simulate things or e.g. simulate how food will taste in your mind as you add different seasonings, you are modeling (part of) the world.

Modeling something as an action is not "having a world model". A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.

> We know LLMs can't be doing visuospatial reasoning using imagery, because they only work with text tokens.

All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.


> Modeling something as an action is not "having a world model".

It literally is, this is definitional. See e.g. how these terms are used in e.g. the V-JEPA-2 paper (https://arxiv.org/pdf/2506.09985). EDIT: Maybe you are unaware of what the term means and how it is used, it does not mean "a model of all of reality", i.e. we don't have a single world model, but many world models that are used in different contexts.

> A model is a consistently existing thing, but humans don't construct consistently existing models because it'd be a waste of time. You don't need to know what's in your trash in order to take the trash bags out.

Both sentences are obviously just completely wrong here. I need to know what is in my trash, and how much, to decide if I need to take it out, and how heavy it is may change how I take it out too. We construct models all the time, some temporary and forgotten, some which we hold within us for life.

> All frontier LLMs are multimodal to some degree. ChatGPT thinking uses it the most.

LLMs by definition are not multimodal. Frontier models are multimodal, but only in a very weak and limited sense, as I address in e.g. other comments (https://news.ycombinator.com/item?id=46939091, https://news.ycombinator.com/item?id=46940666). For the most part, none of the text outputs you get from a frontier model are informed by or using any of the embeddings or semantics learned from images and video (in part due to lack of data and cost of processing visual data), and only certain tasks will trigger e.g. the underlying VLMs. This is not like humans, where we use visual reasoning and visual world models constantly (unless you are a wordcel).

And most VLM architectures are multi-modal in a very limited or simplistic way still, with lots of separately pre-trained backbones (https://huggingface.co/blog/vlms-2025). Frontier models are nowhere near being even close to multimodal in the way that human thinking and reasoning is.


"LLMs cannot backtrack". This is exactly wrong. LLMs always see everything in the past. In this sense they are more efficient than turing machines, because (assuming sufficiently large context length) every token sees ALL previous tokens. So, in principle, an LLM could write a bunch of exploratory shit, and then add a "tombstone" "token" that can selectively devalue things within a certain timeframe -- aka just de exploratory thngs (as judged by RoPE time), and thus "backtrack".

I put "token" in quotes because this would obviously not necessarily be an explicit token, but it would have to be learned group of tokens, for example. But who knows, if the thinking models have some weird pseudo-xml delimiters for thinking, it's not crazy to think that an LLM could shove this information in say the closer tag.


> "LLMs cannot backtrack". This is exactly wrong.

If it wasn't clear, I am talking about LLMs in use today, not ultimate capabilities. All commercial models are known (or believed) to be recursively applied transformers without e.g. backspace or "tombstone" tokens, like you are mentioning here.

But yes, absolutely LLMs might someday be able to backtrack, either literally during token generation if we allow e.g. backspace tokens (there was at least one paper that did this) or more broadly at the chain of thought level, with methods like you are mentioning.


a tombstone "token "doesnt have to be an actual token, nor does it have to be explicitly carved out into the tokenizer. it can be learned. unless you have looked into the activations of a SOTA llm you cant categorically say that one (or 80% of one, fir example) doesn't exist.

We CAN categorically say that no such token or cluster of tokens exists, because we know how LLMs and tokenizers work.

Current LLM implementations cannot delete output text, i.e. they cannot remove text from their context window. The recursive application is such that outputs are always expanding what is in the window, so there is no backtracking like humans can do, i.e. "this text was bad, ignore it and remove from context". That's part of why we got crazy loops / spirals like we did with the "show me the seahorse emoji" prompts.

Backtracking needs more than just a special token or cluster of tokens, but also for the LLM behaviour to be modified when it sees that token or token cluster. This must be manually coded in, it cannot be learned.


without claiming this is actually happening, it is certainly possible to synthetically create a token that ablates the values retrieved from queries in a certain relative time range (via transformations induced by e.g. RoPE encoding)

I've already tried to do what the article claims to be doing: handing-off the context of the current session to another model. I tried various combinations of hooks, prompts and workarounds, but nothing worked like the first screenshot in the article implies ("You've hit your limit [...] Use an open source local LLM"). The best I could come up with is to watch for the warning of high usage and then ask Claude to create a HANDOFF.md with the current context. Then I could load that into another model. Anyone have any better solutions?


Could you give examples of excellent typing keyboards from China for $30-50? Every mechanical keyboard I've owned eventually suffered from key chatter or inconsistent actuation.


Just get a hall effect or TMR keyboard and all your problems with key chatter will go away. Also, I recommend you just build it yourself. It's a fun hobby and if you don't know how to make PCBs it's a great way to learn (keyboards are one of the easiest things to make from a PCB complexity standpoint).

Ever play "connect the dots" as a kid? That's what it's like making a keyboard PCB. It's the adult version of "connect the dots".

It's not a "rabbit hole", it's a pending addiction :D


From the article I like the characteristics of hall effect better than TMR (although one of the cons under HE, "Since the sensor is reading magnet position, any wobble in the switch can change the magnet’s alignment and affect the signal", is a bit troubling). There are indeed $30-50 ones on Amazon. Any particular brand recommendations?


If you notice these things at all, you are the target market for an analog keyboard.


Under the Cons section the article says:

> Full analog functionality often depends on proprietary software support (and not all boards execute it well).

Could you elaborate how that works? I'm on Linux. I find with Keychron I can visit the web-based tool to configure the keyboard, but if it's proprietary software I'm out of luck.


I use a Wooting which is largely open source on GitHub. Also their config tool is only needed to set up your keyboard, the config is saved onto the keyboard and persists across devices, OSes, etc.

Looks like their config software works on Linux if you set up udev to allow it to flash the keyboard: https://help.wooting.io/article/147-configuring-device-acces...


Great article, but doesn't address the fundamental issue: defining quality. Other than some objective metrics like code coverage, there is little agreement about what constitutes good code. The closest thing to a consensus might be the rules encoded in linters/formatters. Each Rubocop or eslint rule had to go through code review and public scrutiny to be included and maintained. Most often the rules are customized per project/team. Of course this runs into the same problem the article mentions: narrowness of vision. It seems the only way to achieve a high-minded ideal is the BDFL model of software development.


Unrelated to parent thread: Sorry to ping you here @Reedlaw but it appears the email on your personal site is rejecting incoming mail.

I'm trying to reach you regarding an older rubygems package you maintain. We're trying to offer you some $ to rename it because we maintain a large project with the same name in other languages and want to start offering it in ruby. I sent you a connect request + message on Linkedin from https://www.linkedin.com/in/nicksweeting/


Unfortunately, reading before merge commit is not always a firm part of human team work. Neither reading code nor test coverage by themselves are sufficient to ensure quality.


Gergely Orosz (The Pragmatic Engineer) interviewed Yegge [1] and Kent Beck [2], both experienced engineers before vibe coding, and they express similar sentiments about how LLMs reinvigorated their enjoyment of programming. This introduction to Gas Town is very clear on its intended audience with plenty of warnings against overly eager adoption. I agree that using tools like this haphazardly could lead to disaster, but I would not dismiss the possibility that they could be used productively.

1. https://www.youtube.com/watch?v=TZE33qMYwsc

2. https://www.youtube.com/watch?v=aSXaxOdVtAQ


Anecdote, but some of the time when I am blasted after a day of thinking for my job all day a design session randomly throwing shit at an LLM hits the spot. I usually make some meaningful progress on a pet project. I rarely let the LLM do much pure vibe coding. I iterate with several LLMs until it looks and feels right and then hack on it myself or let the LLM do drudgery like refactoring or boilerplate to get me over the humps. In that sense I do strongly agree.


Beck was in Melbourne a few weeks ago, and his take on LLM usage was so far divorced from what Yegge is doing that their views on what LLMs are capable of in early 2026 are irreconcilable.


What does Beck think?


He was the keynote at YOW! so I can't capture all the nuance and hope I'm not doing him a disservice with my interpretation, but the tl;dr is he:

"LLMs drastically decrease the cost of experimenting during the very earliest phases of a project, like when you're trying to figure out if the thing is even worth building or a specific approach might yield improvements, but loses efficacy once you're past those stages. You can keep using LLMs sustainably with a very tight loop of telling it to do the thing the cleaning up the results immediately, via human judgement."

I.e, I don't think he can relate at all to the experience of letting them run wild and getting a good result.


That sounds nearly perfect for FSRS [1], the default spaced repetition algorithm used by Anki, which aims at estimating the time it takes for memory stability to decline from 100% to 90%. At the estimated 90% stability point, FSRS would require a review, so naturally a mature deck of flashcards would hover between 90-100% stability.

1. https://expertium.github.io/Algorithm.html


The first sentence is problematic:

> For decades, we’ve all known what “good code” looks like.

When relatively trivial concerns such as the ideal length of methods haven't achieved consensus, I doubt there can be any broadly accepted standard for software quality. There are plenty of metrics such as test coverage, but anyone with experience could tell you how easy it is to game those and that enforcing arbitrary standards can even cause harm.


I agree. Moreover, I submit that “good code” isn’t even a universal constant, but context-sensitive along several dimensions.


> When relatively trivial concerns such as the ideal length of methods haven't achieved consensus

Is the consensus not that there isn't one? Surely that's the only consensus to reach? I don't see how there could possibly be an "ideal length", whatever you pick it'd be much too dogmatic.


John Carmack's and Martin Fowler's coding style advice are diametrically opposed. Carmack advocates inlining complex code that is only used once. Fowler advocates extracting it with a good name to clarify intent. I'm not sure the two views can be reconciled except by noting that they address separate concerns. Carmack prioritizes visibility while Fowler prioritizes intent.


With all due respect to these who as programmers are on a whole different dimension than me.. this seems like a case where both either their words were taken out of context, or one of the millions of cases of brilliant people hyperfixating on their particular domain and mistakenly extrapolating that to everywhere. Their advice could well be right for the particular type of code each of them worked on!

But taking them as general rules for coding makes as much sense as applying advice for painting a bridge to painting the Mona Lisa. Seriously, try to come up with a single piece of advice about programming style that applies to every domain. The closest one I can think of is "give descriptive name to your variables", and even that doesn't apply to lots of code written to this very day. It's impossible.

Software in 2025 is far too varied for any of that to make sense, and it has been for many decades.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: