This is a faithful reproduction of the original Transformer paper [1]. Except, t...

This is a faithful reproduction of the original Transformer paper [1]. Except, these days we use trainable parameters for positional embedding. The paper used a static calculation for positional embedding using sine and cosine.

Figure 1 in the paper can be seen implemented in the forward() method of the GPT class in model.py. Here are the rough steps:

1. Tokens are embedded using a nn.Embedding layer. 2. Tokens are positionally embedded using a nn.Embedding layer. 3. The two embedding values are added to make the input x. 4. A sequence of N number of transformer blocks are then executed. This is the grey box in the left of the Figure 1. This is where all the magic happens. Chiefly in the self attention calculation. You can see this in the forward() method of the CausalSelfAttention class. 5. A regular nn.Linear layer is executed. 6. Finally the output token probabilities are calculated using F.cross_entropy (shown as softmax in the figure).

I hope this helps a little. Please feel free to suggest improvements and additions.

[1] https://arxiv.org/pdf/1706.03762