Depends on the loss function. Softmax final activation into cross entropy loss (or KL divergence) gives probability like predictions. This is a very common set up but there are many others that don’t have this property. I figure that’s what you mean by ‘almost always’. You can also use variational inference where you predict a distribution (usually Gaussian so a sigmoid activation with two values per prediction) and use a Wasserstein loss function and this can be used to get confidence intervals among other things.