The lower dimensional logits are discarded, the original high dimensional latent...

jacob019 · 2025-05-23T19:08:23 1748027303

I don't think that's accurate. The logits actually have high dimensionality, and they are intermediate outputs used to sample tokens. The latent representations contain contextual information and are also high-dimensional, but they serve a different role--they feed into the logits.

valine · 2025-05-23T19:16:25 1748027785

The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

pyinstallwoes · 2025-05-24T03:21:32 1748056892

Where does it happen ?

valine · 2025-05-24T16:47:03 1748105223

My personal theory is that it’s an emergent property of many attention heads working together. If each attention head is a bird, reasoning would be the movement of the flock.

bcoates · 2025-05-23T22:21:55 1748038915

Either I'm wildly misunderstanding or that can't possibly be true--if you sample at high temperature and it chooses a very-low probability token, it continues consistent with the chosen token, not with the more likely ones

valine · 2025-05-23T22:29:46 1748039386

Attention computes a weighted average of all previous latents. So yes, it’s a new token as input to the forward pass, but after it feeds through an attention head it contains a little bit of every previous latent.