Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It’s mostly a convention. In many deep learning frameworks (PyTorch, TensorFlow, etc.), inputs are stored with the “batch × length × hidden-dim” shape, effectively making the token embeddings row vectors. Multiplying “xW” is then the natural shape-wise operation. On the other hand, classical linear algebra references often treat vectors as column vectors and write “Wx.”


Isn't batch-first a Pytorch thing? I started with Tensorflow and it's batch-last.


TFv1 or TFv2? AFAIK it's batch-first in TFv2




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: