I just don’t understand that — I thought deep neural nets were inherently hierar...

mdp2021 · 2025-01-01T16:31:46 1735749106

Neural Nets can be made to be hierarchical - I would say a most notable example is the Convolutional Neural Network so successfully promoted by Yann Le Cun.

But the issue with the LLMs architectures in place is with the idea of "predicting the next token", so strident with the exercise of intelligence - where we search instead for the "neighbouring fitting ideas".

So, "hierarchical" in this context is there to express that it is typical of natural intelligence to refine an idea - formulating an hypothesis and improving its form (hence its expression) step after step of pondering. The issue of transparency in current LLMs, and the idea of "predicting the next token", do not help in having the idea of typical natural intelligence mechanism and the tentative interpretation of LLM internals match.

nightski · 2025-01-01T17:26:37 1735752397

Is that true? There are many attention/mlp layers stacked on top of each other. Higher level layers aren't performing attention on input tokens, but instead on the output of the previous layer.

mdp2021 · 2025-01-01T17:38:57 1735753137

> Is that true

Well, if you are referring to «The issue of transparency in current LLMs», I have not read an essay that explains satisfactorily the inner process and world modelling inside LLMs. Some pieces say (guess?) that the engine has no idea what the whole concept in the reply would be before outputting all the tokens, others swear it seems impossible it has no such idea before formulation...

throwawaymaths · 2025-01-01T18:23:33 1735755813

there is a way that "predicting the next token" is ~append-only turing machine. Obviously the tokens we're using might be suboptimal for whatever goalpost "agi" is at any given time, but the structure/strategies of LLMs is probably not far from a really good one, modulo refactoring for efficiency like MAMBA (but still doing token stream prediction, esp. during inference)

motoboi · 2025-01-02T03:25:16 1735788316

Not necessarily.

For visual tasks, that is the state of the art, with visual features being "gouped" into more semantically relevant parts ("circles" grouped into "fluffy textures" grouped into "dog ears"). This hierarchy building behavior is baked into the model.

For transformers, not so much. Although each transformer block output serve as input for the next block, they can learn hierarchical relationship (in latent space, not in human language), but that is not backed nor enforced in the architecture.