Seems like a big deal. I feel like this could enable a new level of modularity a...

valine · 2024-11-01T19:10:50 1730488250

Unless I’m reading it wrong I don’t think rows matter. Attention doesn’t take into account sequence position natively, that’s why positional encodings exist.

davesque · 2024-11-01T19:36:44 1730489804

I'm talking about the rows in the new K and V matrices introduced by the paper, not rows in the input sequence. The ordering of rows in the new K and V matrices does matter in the sense that rows that appear further down were added later in the training process to add new parameter tokens during scaling. So those newer parameters may represent knowledge that is less fundamental and more about fine tuning on the training set.

paraschopra · 2024-11-02T06:08:30 1730527710

But after adding new rows, I think entire network is retrained.