GPT looks a lot like an IIR filter that transforms a sequence of vectors. Edit: IIR filters are linear functions of N past inputs and N past output - the latter gives them "memory" and non-trivial abilities to process signals. GPT is mostly linear, uses 8192 past inputs and outputs. I'd be tempted to introduce the 3rd sequence - an "internal buffer" with 8192 tokens - that GPT updates even with null inputs, the process that corresponds to "thinking".