I wasn't disputing that it's feed-forward. I just meant that stacked transformer layers can be thought of as an iterative refinement of the intermediate activations. Not the same as an autoregressive process that receives previous outputs as inputs, but far more expressive than a single transformer layer.