In Big-O notation, O(2n) = O(n). Two times slower is actually not that much. If ...

godeldirac · on Oct 11, 2024

Glad to see your ideas here. Could you clarify a point to me? The W matrix in the paper is d_model x 2d. Does this mean a differential attention model will double the W matrix of a standard attention model, which is d_model x d? E g. Suppose llama3 has W of 8192 x 1024, does the diffattn model of the same architecture have W of 8192 x (1024 x 2)?

cuteboy19 · on Oct 9, 2024

The O for any transformer is always quadratic