In late 90s/early 2000s the mainstream thought around numerical optimization was...

In late 90s/early 2000s the mainstream thought around numerical optimization was that it was easy-ish when it was a linear problem, and if you had to rely on nonlinear optimization you were basically lost. People did EM (an earlier subgenre of what is now called Bayesian learning) but knew that it was sensitive to initialization and that they probably didn't hit a good enough maximum. Late 90s neural networks were basically a parlor trick - you could make it do little tricks but almost everything we have now including lots of compute, good initialization, regularization techniques, and pretraining, was absent in the late 90s.

Then in the mid and later 2000s the mainstream method was convex optimization and you had a proof that there was one global optimum and a wide range of optimization methods were guaranteed to reach it from most initialization points. Simultaneously, the theory underlying SVMs and CRFs was developed - that you could actually do a large variety of things and still use these easy, dependable optimization techniques. And people hammered home the need for regularization techniques.

In the late 2000s to early 2010s, several things again came together - one being the discovery of DropOut as a regularization technique - and the understanding that it was one, the other being the development of good initializers that made it possible to use deeper networks. Add to that improved compute power - including the development of CUDA which started out as a way to speed up texture computation but then led to general purpose GPU computing as we know it today. All this enabled a rediscovery of NN learning which could take off where linear learning methods (SVMs, CRFs) had plateaued before. And often you had a DNN that did what the linear classifier before did but could learn features in addition to that - and could be seen as finding a solution that was strictly better.

But the lack of global optimum means that - even with good initializers and regularization packaged into the NN modules we use in modern DNN software implementations - the whole thing is way more finicky than CRFs ever were. (It would be wrong to say that CRFs are trivial to implement or never finicky at all, just as many well-understood NN architectures have a good out-of-the-box experience with TF/PyTorch etc. - so take this as a general statement that may not hold for all cases).