mattwilson's comments

mattwilson · on Sept 9, 2015

The Factorization Machine term can be computed in O(kn) time where k is a hyper-parameter of the model and n is the number of features in the model. The features are embedded in a k dimensional space so there is nothing infinite dimensional going on with FM.

mattwilson · on Sept 9, 2015

Without factorization machines, you would have O(n^2) quadratic weights w_{i,j} to learn. This is bad in a setting like online advertising where you might have millions of features (i.e. n is large) for two reasons. First, this makes training your model very slow. Second, you won't have enough observations to learn all weights w_{i,j} well. So using all quadratic weights in the online advertising setting is not feasible. But we still want to include feature interaction information. Enter FM.

To answer your question, FM is good for two main reasons. First, as a result of some mathematical magic, we can train the model in a time that scales linearly with the time needed to train a model without interaction terms. Second, each of the n features gets embedded in an inner product space with similar features ending up close to one another in some sense. This allows us to make a decent estimate of the interaction between features even if they do not appear frequently in the training data.