It’s mostly a convention. In many deep learning frameworks (PyTorch, TensorFlow,... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		t55 7 months ago \| parent \| context \| favorite \| on: DeepSeek's multi-head latent attention and other K... It’s mostly a convention. In many deep learning frameworks (PyTorch, TensorFlow, etc.), inputs are stored with the “batch × length × hidden-dim” shape, effectively making the token embeddings row vectors. Multiplying “xW” is then the natural shape-wise operation. On the other hand, classical linear algebra references often treat vectors as column vectors and write “Wx.”

anvuong 7 months ago [–]

Isn't batch-first a Pytorch thing? I started with Tensorflow and it's batch-last.

t55 7 months ago | [–]

TFv1 or TFv2? AFAIK it's batch-first in TFv2

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact