Depends on the loss function. Softmax final activation into cross entropy loss (...

Depends on the loss function. Softmax final activation into cross entropy loss (or KL divergence) gives probability like predictions. This is a very common set up but there are many others that don’t have this property. I figure that’s what you mean by ‘almost always’. You can also use variational inference where you predict a distribution (usually Gaussian so a sigmoid activation with two values per prediction) and use a Wasserstein loss function and this can be used to get confidence intervals among other things.