Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In Big-O notation, O(2n) = O(n). Two times slower is actually not that much. If this slowdown results in better inference in the same number of training rounds or better-tuned weights with fewer redundant features, that can be a very worthwhile sacrifice.

It's also a complex optimization problem, not just about computing. Two times, the parameters take more than two times the time to tune and two times the working memory to train and use. There are also plenty of model training scenarios where data throughput from the dataset into memory and back out is the final bottleneck.

So, though I agree it is indeed a downside, I think it's a worthwhile sacrifice if the results they show are reproducible.



Glad to see your ideas here. Could you clarify a point to me? The W matrix in the paper is d_model x 2d. Does this mean a differential attention model will double the W matrix of a standard attention model, which is d_model x d? E g. Suppose llama3 has W of 8192 x 1024, does the diffattn model of the same architecture have W of 8192 x (1024 x 2)?


The O for any transformer is always quadratic




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: