It’s mostly a convention. In many deep learning frameworks (PyTorch, TensorFlow, etc.), inputs are stored with the “batch × length × hidden-dim” shape, effectively making the token embeddings row vectors. Multiplying “xW” is then the natural shape-wise operation. On the other hand, classical linear algebra references often treat vectors as column vectors and write “Wx.”