I wouldn't say we *can't*, but it would be much harder. The models were trained ...

I wouldn't say we can't, but it would be much harder. The models were trained and optimized for next word prediction. It is possible to chop the output layers off and replace them with something else. This is often how more open models like BERT are adapted to tasks such as classification and sentiment analysis. But pulling semantically meaningful information out of internal states of the model is tricky because there's not necessarily anything about the model architecture and training methods that forces it to develop internal representations that are particularly interpretable or have a straightforward application to some other task.

They may not be all that stable, either. I would not just assume that knowing how to interpret the attention heads in the base model of GPT-4 tells you anything about what the corresponding attention heads are doing in GPT-4t or GPT-4o.