Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The news isn't that there is a correlation but that there is such a strong correlation. There are a bunch of specific techniques Google could be using and it looks likely that this is close to what they actually use.

They also use a lot of other ranking factors beyond just the words on the page so seeing such a high correlation from a "bag of words" model is pretty interesting (to me at least).



The correlation really isn't all that high.

If I've read the graph right, it's about 0.33. For a Pearson (product-moment) correlation coefficient, that would mean that about 10% of the variance in Google rankings is explained by a linear regression on LDA scores. They've actually used the Spearman (ranking-based) correlation coefficient, which is equivalent to ranking all the values of each variable from 1..N and then computing the Pearson correlation coefficient for the ranks. So, kinda-sorta with lots of handwaving, that means that about 10% of the ordering of the Google rankings is explained by the LDA scores.

Clearly that's a lot better than for the other scoring methods they mentioned, and that probably indicates that Google are doing something a bit like LDA (but this will be true for any approach that takes note of synonyms, and it's hardly news that Google do that). But it doesn't, e.g., suggest that PageRank and other things based on link structure aren't extremely important to Google's rankings.


Google talks about using over 200 ranking factors so I think a correlation this high with a single factor is actually quite interesting. Especially given that we are seeing a correlation with a model that is undoubtedly more naive than what Google is actually using.

It is also fascinating to see this much correlation with an on-page factor which is entirely in the webmaster's control. Previously the highest correlations had been with link metrics.

YMMV but this research is interesting to me as someone who works in this field.


This post should be upvoted a zillion times. The correlations they report are really quite low and as such their claims are really quite bogus.


I don't know if I saw my comment (looks like we posted at similar times). I'd be interested in your comments relative to that: there are a lot of factors this is quite a high correlation for a single one... (I think).


(Thanks for the reply)

Did you test non-LDA methods? Because to me it looked like a correlation between a set of related words and ranking for a topic related to those words.

Without testing non LDA methods I can't see what you've proved.


Yes. The chart in the post shows how low the correlation is for tf-idf. I believe the original also showed similarly poor results for LSI etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: