It depends wildly (really, *that* wildly) on what it is exactly that you're doin...

It depends wildly (really, that wildly) on what it is exactly that you're doing with them.

One of the biggest problems with practical applications of generative AI right now is that it's basically impossible to tell which models are really good at which things without trying that specific task. There are some generalizations (e.g. you can measure more abstract metrics like capacity for spatial reasoning, and they do affect performance in ways you'd expect), but there's far more uncertainty.

This is also why many people get so pissed when companies retire models. Even if the replacement is seemingly better in the metrics, it's not a given that it's better at your specific thing. Or it may be better, but only if you write a completely different prompt, and, again, the only way to discover that magic correct prompt is through experimentation. Hence why it feels less like engineering and more like shamanism a lot of the time.