Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Seems like a big deal. I feel like this could enable a new level of modularity and compatibility between publicly available weight sets, assuming they use similar channel dimensions. Maybe it also provides a nice formalism for thinking about fine tuning, where you could adopt certain heuristics for adding/removing key-value pairs from the Pattention layers.

One interesting thing to note: sounds like model scaling happens on the fly by adding key-value pairs as rows in the K and V matrices on the Pattention layer. That suggests that weights represented by tokens in the first rows may be more important than weights in later rows. There may be a lot you could do with that ordering of weights in terms of pruning and such.



Unless I’m reading it wrong I don’t think rows matter. Attention doesn’t take into account sequence position natively, that’s why positional encodings exist.


I'm talking about the rows in the new K and V matrices introduced by the paper, not rows in the input sequence. The ordering of rows in the new K and V matrices does matter in the sense that rows that appear further down were added later in the training process to add new parameter tokens during scaling. So those newer parameters may represent knowledge that is less fundamental and more about fine tuning on the training set.


But after adding new rows, I think entire network is retrained.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: