>(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)
Ehhhh, energy-based models are trained via contrastive divergence, not just minimizing a simple loss averaged over the training data.
Ehhhh, energy-based models are trained via contrastive divergence, not just minimizing a simple loss averaged over the training data.