Your training input has the shape of (sequence length x batch size). If a lot of your samples are shorter than sequence length, as is usually the case, you will have a lot of padding tokens in the input, which is wasted compute.
To compensate for that, you can pack multiple examples in the same sequence. This is there EOS and BOS come in, as they indicate to the model that the two parts of the sequence are not related.
To compensate for that, you can pack multiple examples in the same sequence. This is there EOS and BOS come in, as they indicate to the model that the two parts of the sequence are not related.