Thanks for the detailed comment! It's interesting that simple and classical tech...

syllogism · on July 9, 2016

Imagine looking at an image as a "bag of pixels": shuffle them all up, and look at the resulting image. What do you see? Nothing useful, right?

Now look at a bag-of-words view of a movie review:

    set(['and', 'predecessor,', 'immersive;', 'script,', 'is', 'an', 'engaging', 'as', 'home', 'still', 'its', 'film', 'identity.', 'puts', 'dazzling', 'issues', 'visually', 'colorful', 'While', 'not', 'spin', 'on', 'of', 'while', 'the', 'predictable,'])

It's not certain, but there's definitely a lot more information there.

The predict loop of a linear model works like this (written with sparse vectors, implemented as dictionaries):

    def predict(classes, weights, features):
        scores = {clas: 0 for clas in classes}
        for feature in features:
            for clas, weight in weights[feature].items():
                scores[clas] += weight
        return max(scores, key=lambda clas: scores[clas])

This function is the same for Naive Bayes, Maximum Entropy, linear-kernel SVM, Averaged Perceptron...etc.

All you get to do is attach a weight to each feature, which you'll sum. You can make the features larger or smaller slices of the input, but that's all the structure you get.

Note that linear models have massive disadvantages for speech recognition, too. Linear models don't work very well if you have an analog signal.

rcpt · on July 10, 2016

I think bag of pixels would be more analogous to a bag of characters. A bag of words is more like a bag of SIFT features.

msandford · on July 10, 2016

Sure but n-gram feature extraction is what, five lines of code? It's a trivial transform compared to SIFT.

If you don't do SIFT manually prior to classification then your NN has to evolve something "similar" in order to work. Which is why it needs to be deep.

Chris2048 · on July 11, 2016

"This movie make all other movies look awful - do not fail to see, missing it would be a crime"

Is this a negative, or a positive review?

awful -1 fail -1 miss[ing] -1

Seems pretty negative to me...

You can build a fairly accurate nudity filter by detecting % skin-tone pixels, but that extra mile to distinguish bikini/beach pics from nudes is the real crux.

Smerity · on July 9, 2016

That's a really good question and why the initial promise of deep learning was best shown in vision. It is only recently that deep learning in text has given enough performance gains to justify the (large) computational overhead.

For many languages, and many tasks performed on these languages, features have already been discretely provided. If you take English, split on the space character, and then throw those into a standard classifier (Naive Bayes, SVM, logistic regression), you're likely to get a reasonable result. If you add n-grams, so a little bit of collocation information, you tend to do even better. For most tasks, you may even make the simplifying assumption that features don't even interact at all. Even naive use of these methods will get you 90% of the way in many NLP tasks with very little computational overhead (as is evidence in this article).

For other languages, generating those features is a more complicated matter. Chinese is written without spaces between words for example. Extracting such discrete features from there requires an extra step in the pipeline. The way in which you split the characters can also depend on context, requiring an interaction between these components. This interaction is very iffy when it's in a pipeline and you don't have gradients flowing end to end, as is the case with most deep learning systems.

Images are far more like the latter case. There are few discrete features that can be easily extracted from images. Images are also scarily high dimensional - the number of pixels, the number of colours, the interaction between pixels, etc. Using human generated methods of feature extraction on this is fraught with a lot of complicated hand tuning and a lot of misses.

tldr; Text, by virtue of being human generated, is far more structured and far more amenable to easily extracting discrete representative features. For images, we just never really found a highly effective way of producing discrete representative features by hand. Deep learning, by virtue of providing the feature extraction in an automated manner directed by the loss on the task, has helped solve a major issue that plagued computer vision.

taeric · on July 10, 2016

It isn't that different than some tricks in image recognition. Look at basic training schemes for digit recognition. A feature you can start with is basically how many pixels are used.

The basic idea is how can you start reducing the pattern space. So, for images, don't consider all colors and only consider brightness.