Your training input has the shape of (sequence length x batch size). If a lot of...

thomasahle · on June 27, 2024

You can just do that my shaping the attention mask, no? That also gives you an actual guarantee that no information is leaked between conversations.

suryabhupa · on June 27, 2024

In practice, and at scale, that's exactly what having <bos> and <eos> tokens allow you to easily and programmatically do.

danielmarkbruce · on June 27, 2024

You can't pack multiple examples into a single row of a matrix without knowing where one begins and one ends.