This can also be looked at as the original source of patch based denoising, etc. In the end it's about capturing the scaling properties and self similarity of natural images. This is also, for example, why wavelets were so effective as a basis.
David Mumford particularly did some great work on this sort of thing a couple of decades ago, along with many others. I hope when people are rushing around trying to apply convolutional nets to everything they aren't losing these insights.
The benefit of CNNs is like the benefits of SVMs -- they generalize all the great old techniques so you don't have to understand them all, you just throw more CPU at the optimization problem.
I don't think that's true particularly in this case.
This paper is pointing out that you can encode a structural prior in a CNN - but knowing the "great old techniques" will help you design the right network architecture to do that.
SVMs were a surprise when they came out, no so much a generalization as a challenge (at least at first)
David Mumford particularly did some great work on this sort of thing a couple of decades ago, along with many others. I hope when people are rushing around trying to apply convolutional nets to everything they aren't losing these insights.