That's somewhat similar to my second mini-project of my NLP course (that I just ...

That's somewhat similar to my second mini-project of my NLP course (that I just delivered like one week ago).

What we had to do was, first to gather a large corpus (in this case) of portuguese. Then, produce the frequencies of each unigram and bigram. Then, process those frequencies and output the top 10 and bottom 10 collocations of the corpus. Collocation definition from wikipedia: "Within the area of corpus linguistics, collocation defines a sequence of words or terms that co-occur more often than would be expected by chance".

The method used to calculate the collocations was based on a chi-square test. The values with higher chi-square were the most probable collocations (if above the critical value that would reject the null hypothesis).

So, I actually have Ruby code done that I could change just a little bit for this, but there are probably more clever methods.