Q, K, V are not the same. In self-attention, they are all computed by separate linear transformation of the same input (ie the previous layer’s output). In cross-attention even this is not true, then K and V are computed by linear transformation of whatever is cross-attended, and Q is computed by linear transformation of the input as before.
yeah a common misconception people think because the input is the same they forget that their is a pre attention linear transofrmation for q k and v (using the decoder only version obv v is diff with encoder decoder bert style)