Sure but n-gram feature extraction is what, five lines of code? It's a trivial transform compared to SIFT.
If you don't do SIFT manually prior to classification then your NN has to evolve something "similar" in order to work. Which is why it needs to be deep.