This feels like a failure to learn the bitter lesson: You're just taking the tra...

mdp2021 · 2025-01-01T09:29:09 1735723749

It is explicitly stated in the paper that

> One may argue that LLMs are implicitly learning a hierarchical representation, but we stipulate that models with an explicit hierarchical architecture are better suited to create coherent long-form output

And the problem remains that (text surrounding the above):

> Despite the undeniable success of LLMs and continued progress, all current LLMs miss a crucial characteristic of human intelligence: explicit reasoning and planning at multiple levels of abstraction. The human brain does not operate at the word level only. We usually have a top-down process to solve a complex task or compose a long document: we first plan at a higher level the overall structure, and then step-by-step, add details at lower levels of abstraction. [...] Imagine a researcher giving a fifteen-minute talk. In such a situation, researchers do not usually prepare detailed speeches by writing out every single word they will pronounce. Instead, they outline a flow of higher-level ideas they want to communicate. Should they give the same talk multiple times, the actual words being spoken may differ, the talk could even be given in different languages, but the flow of higher-level abstract ideas will remain the same. Similarly, when writing a research paper or essay on a specific topic, humans usually start by preparing an outline that structures the whole document into sections, which they then refine iteratively. Humans also detect and remember dependencies between the different parts of a longer document at an abstract level. If we expand on our previous research writing example, keeping track of dependencies means that we need to provide results for each of the experiment mentioned in the introduction. Finally, when processing and analyzing information, humans rarely consider every single word in a large document. Instead, we use a hierarchical approach: we remember which part of a long document we should search to find a specific piece of information. To the best of our knowledge, this explicit hierarchical structure of information processing and generation, at an abstract level, independent of any instantiation in a particular language or modality, cannot be found in any of the current LLMs

motoboi · 2025-01-01T12:52:39 1735735959

I suppose humans need high level concepts because we can only hold 7[] things in working memory. Computers don’t have that limitation.
Also, humans cannot iterate over thousands of possibilities in a second, like computers do.
And finally, animal brains are severely limited by heat dissipation and energy input flow.
Based on that, artificial intelligence may arise from unexpected simple strategies, given the fundamental differences in scale and structure from animal brains.

- where 7 is whatever number is the correct number nowadays.

dr_dshiv · 2025-01-01T14:00:14 1735740014

I just don’t understand that — I thought deep neural nets were inherently hierarchical. Or at least emergently hierarchical?

mdp2021 · 2025-01-01T16:31:46 1735749106

Neural Nets can be made to be hierarchical - I would say a most notable example is the Convolutional Neural Network so successfully promoted by Yann Le Cun.

But the issue with the LLMs architectures in place is with the idea of "predicting the next token", so strident with the exercise of intelligence - where we search instead for the "neighbouring fitting ideas".

So, "hierarchical" in this context is there to express that it is typical of natural intelligence to refine an idea - formulating an hypothesis and improving its form (hence its expression) step after step of pondering. The issue of transparency in current LLMs, and the idea of "predicting the next token", do not help in having the idea of typical natural intelligence mechanism and the tentative interpretation of LLM internals match.

nightski · 2025-01-01T17:26:37 1735752397

Is that true? There are many attention/mlp layers stacked on top of each other. Higher level layers aren't performing attention on input tokens, but instead on the output of the previous layer.

mdp2021 · 2025-01-01T17:38:57 1735753137

> Is that true

Well, if you are referring to «The issue of transparency in current LLMs», I have not read an essay that explains satisfactorily the inner process and world modelling inside LLMs. Some pieces say (guess?) that the engine has no idea what the whole concept in the reply would be before outputting all the tokens, others swear it seems impossible it has no such idea before formulation...

throwawaymaths · 2025-01-01T18:23:33 1735755813

there is a way that "predicting the next token" is ~append-only turing machine. Obviously the tokens we're using might be suboptimal for whatever goalpost "agi" is at any given time, but the structure/strategies of LLMs is probably not far from a really good one, modulo refactoring for efficiency like MAMBA (but still doing token stream prediction, esp. during inference)

motoboi · 2025-01-02T03:25:16 1735788316

Not necessarily.

For visual tasks, that is the state of the art, with visual features being "gouped" into more semantically relevant parts ("circles" grouped into "fluffy textures" grouped into "dog ears"). This hierarchy building behavior is baked into the model.

For transformers, not so much. Although each transformer block output serve as input for the next block, they can learn hierarchical relationship (in latent space, not in human language), but that is not backed nor enforced in the architecture.

anon373839 · 2025-01-01T09:08:02 1735722482

The bitter lesson isn’t a law of nature, though. And as GPT-style LLMs appear to be at the foot of a scaling wall, I personally think inductive bias is due for a comeback.

Der_Einzige · 2025-01-01T09:09:00 1735722540

Everyone keeps claiming this but we have zero evidence of any kind of scaling wall what-so-ever. Oh you mean data? Synthetic Data, Agents, and Digitization solve that.

anon373839 · 2025-01-01T09:31:17 1735723877

I disagree, but I also wasn’t referring to the exhaustion of training materials. I am referring to the fact that exponentially more compute is required to achieve linear gains in performance. At some point, it just won’t be feasible to do $50B training runs, you know?

throw5959 · 2025-01-01T10:05:38 1735725938

50B still seems reasonable compared to the revenue of the Big AI companies.

mentalgear · 2025-01-01T11:08:36 1735729716

what revenues? If by big AI companies you mean llm service providers (OpenAI, ...), their revenues are far from high or profitable. https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...

Maybe Nvidia, but they are a chip / hardware maker first. And even for them 50B training run with no exponential gains seems unreasonable.

Better to optimize the architecture / approach first, which also is what most companies are doing now before scaling out.

throw5959 · 2025-01-02T10:27:31 1735813651

It's not unusual to make infrastructure investments that will pay off in 30-50 years. I don't see why not an AI model - unless it's not true that we're at the end of scaling.

cubefox · 2025-01-01T09:54:07 1735725247

There were multiple reports confirming that OpenAI's Orion (planned to be GPT-5) yielded unexpectedly weak results.

pegasus · 2025-01-01T19:34:09 1735760049

And not just OpenAI is facing this problem. Anthropic and Google as well.

Der_Einzige · 2025-01-01T20:54:36 1735764876

So Deepseek V3 did nothing to show you how wrong this take is?

UltraSane · 2025-01-01T20:23:21 1735763001

And costs $500 million per training run.

UltraSane · 2025-01-01T20:22:18 1735762938

There seems to be a affordable scaling wall.

Jensson · 2025-01-01T12:01:41 1735732901

> You're just taking the translation to concepts that the LLM is certainly already doing and trying to make it explicitly forced.

That is what tokens are doing in the first place though, and you get better results with tokens instead of letters.

mdp2021 · 2025-01-01T12:09:55 1735733395

Well, individual letters in these languages in use* do not convey specific meaning, while individual tokens do - so, you cannot really construe a ladder that would go from letter to token, then from token to sentence.

This said, to research whether the search for concepts (in the solutions space) works better than the search for tokens seems absolutely dutiful, in absence of a solid theory that showed otherwise.

(*Sounds convey their own meaning e.g. in proto-Indo-European according to some interpretations, but that becomes too remote in the current descendants - you cannot reconstruct the implicit sound-token in words directly in English, just from the spelling.)

IanCal · 2025-01-01T12:30:29 1735734629

Is that true? I thought there was a desire to move towards byte level work rather than tokens, and that the benefits of tokens was more that you are reducing the context size for the same input.

fngjdflmdflg · 2025-01-01T18:02:43 1735754563

>there was a desire to move towards byte level work rather than tokens

Yeah, latest work on this is from Meta a last month.[0] It showed good results.

[0] https://ai.meta.com/research/publications/byte-latent-transf... (https://news.ycombinator.com/item?id=42415122)

mdp2021 · 2025-01-01T08:23:22 1735719802

That should be proven. The two approaches - predicting tokens vs predicting "sentences" - should be compared to see how much their output differ in terms of quality.

Edit2: ...and both (and their variants) be compared to other ideas such as "multi-token prediction"...

Edit: or, appropriateness of the approach should be demonstrated after acquired "transparency" of how the LLMs effectively internally work. I am not aware of studies that make the inner workings of LLMs adequately clear.

Edit3: Substantially, the architecture should be as solid as possible (and results should reflect that).

blackeyeblitzar · 2025-01-01T15:05:59 1735743959

Isn’t “sentence prediction” roughly the same as multi token prediction of sufficient length? In the end are we just talking about a change to hyper parameters or maybe a new hyper parameter that controls the granularity of “prediction length”?

mdp2021 · 2025-01-01T16:19:15 1735748355

> multi token prediction of sufficient length

Is multi token prediction the same as predicting the embedding of a complex token (the articulation of those input tokens in a sentence)?

blackeyeblitzar · 2025-01-01T18:54:40 1735757680

To be honest I don’t know. Maybe the only way to know is to build and measure all these variations.

macawfish · 2025-01-01T16:34:51 1735749291

At a performance boost of 10-100x :)