From the website, here are some example "documents":
roger heavens trinity cricketers willie sugg early cambridge cricket giles phillips cambridge university cricket
roger hornetttom pridmore 020 7399 4270 collins stewartalan geeves 020 7523 8800 buchanan communicationscharles ryland /isabel podda 020 7466 5000
roger hubbold aig .1 force-field navigation
Now, what is the point in trying to generate any kind of "meaning" from those documents if they consist of completely meaningless gibberish?
As I was reading this challenge, I immediately thought of spam filtering / youtube comment classification ("smartness" classification) / etc as a potential useful application of this technology.
For example if each "document" is a youtube comment, then in theory you could write an algorithm to examine each comment and output a "smartness guess" for each. Then you (as in, you personally, by hand) would look at the results and specify your own "smartness rating" for a few comments. Then you'd run an algorithm to look at the difference between your specified "smartness rating" and the "smartness guess". Then, using that difference, it would tweak the settings in the original algorithm until it outputs a "smartness guess" that more closely fits your "smartness rating". If you repeat that process enough times, and your original algorithm has enough modifiers to tweak, then you might wind up with an algorithm that can make a pretty good guess about whether any given youtube comment is retarded or not.
And that's just one example of a practical application for this kind of thing.
That said, if the input "documents" are completely and utterly meaningless, then there does not seem to be any point in trying to build "meaning" from those inputs. (Garbage in, garbage out.)
It's possible he's randomly generated the docs or something. Which, I agree, is a bit lame. It's nice to get optical confirmation that what you're doing is correct, and to get some real insight into language.
Much better would to have been take real sentences from wikipedia.
No. Grandparent has picked out a few pathological examples out of many good ones.
Here is a uniform, random sample of documents:
faststretch details
honourable judge
oriental museum http
london wc 1n 3 bg pdsg seminar
1857-8 indian mutiny
vera bell
katherine r.b
rare books collection
novy arbat street
universitys case
The data set is unique terms that occur in a crawl of .uk.
Here is how I constructed the data set:
I took the UKWAC web-as-corpus crawl (2 billion words, crawled in 2008), ran it through the splitta sentence splitter, removed all funny characters, ran the Penn treebank word tokenizer, and perform term extraction with topia.termextract, discarding terms that will single words.
I then lowercased the terms, sorted them, and uniqued them, to give the dataset.
That's somewhat similar to my second mini-project of my NLP course (that I just delivered like one week ago).
What we had to do was, first to gather a large corpus (in this case) of portuguese. Then, produce the frequencies of each unigram and bigram. Then, process those frequencies and output the top 10 and bottom 10 collocations of the corpus. Collocation definition from wikipedia: "Within the area of corpus linguistics, collocation defines a sequence of words or terms that co-occur more often than would be expected by chance".
The method used to calculate the collocations was based on a chi-square test. The values with higher chi-square were the most probable collocations (if above the critical value that would reject the null hypothesis).
So, I actually have Ruby code done that I could change just a little bit for this, but there are probably more clever methods.
This sounds interesting but the main challenge here seems to be to define the question: What is a semantically related word.
I can think of three classes - variations (singular/plural), synonyms (thesaurus), contextual (occur in the same sentence/paragraph).
Once the definition of "semantically related" is established, it should be easy to find a clever algorithm to produce results. However, out of context the problem is a little bit boring - in order to find a good solution I'd like to know the real world problem the list of "semantically related" words is going to solve.
There are two aspects of the problem that I consider particularly interesting: a) defining the problem, because being able to define the problem and justify your answer is half the puzzle in a lot of NLP work. b) scaling to a corpus of this size (vocabulary of 1M), because this scale is tricky but useful in many web problems.
Semioticians typically distinguish between paradigmatic and syntagmatic axes of semantic relatedness. Paradigmatic means that two words occur with similar other words, e.g. these two words typically have the same word immediately to their left, like "blue" and "azure". Syntagmatic means that the words typically co-occur in usage, like "blue" and "sky". Check out the image on this page for another illustration of these axes: http://www.aber.ac.uk/media/Documents/S4B/sem03.html
Regardless of whether you choose to do a paradigmatic or syntagmatic analysis, it's interesting to see how you motivate your approach and if you can scale it to 1M different vocabulary words.
I think the more fair comparison would be to show THE SAME random subset for each entry. I.e. same X words for all result sets.
Otherwise, it might happen that a superior result would just show words that don't even have good results, while an inferior subset would get better covered words.
I think the challenge implies that the only data you can use is the provided dataset. Otherwise it's just make no sense - if you pull the similarity data from other place - then where's the challenge?
Somebody else in this thread brought up performing some kind of k-gram analysis and building a "thesaurus" of sorts from that. While that can be really good for vector space style document matching, if you try and actually "read" the results, you can get some weirdness.
Hmm. Without any thought given to computability:
Rearrange the words of every document into alphabetical order, then sort the whole list of documents into alphabetical order. Now, start reading the documents, and for each one make it a tree with each word a leaf, BUT if there is already a tree with a fragment of the same phrase, then that tree is linked to as a node of the new tree.
So for the document 'christmas day' the tree
______d1______.
__/_________\__.
christmas___day
is formed. But for the document 'chistmas day special', the tree
_____d2______.
___/____\____.
d1_______special
is formed. Note: word order doesn't matter. For any given document, how will it know if there's already a tree that uses some of the same words? When a tree is created from a document, give it a hash for the full set of words, plus hashes for sub-strings. These can be stored in a big lookup table, so new documents know where to look for relevant trees (or -- use whatever solution is well established for this sort of problem). Might also need to algorithmically redefine older trees as we go along, ie. divide a tree into smaller phrases
For each leaf let it keep track of what trees its part of (including grandparents and above). Then to find the top-k, for each leaf object (if it's in the vocabulary), print out all the other leaves of all the trees its part of. Like
if leaf.__str__ == 'day':
___ for tree in leaf.trees
______ print tree.leaves
Then do a simple frequency count to find the top-k associated words.
1. Get some million tweets from twitter API (each tweet is exactly like what the article describes as a 'document')
2. After ignoring common words (e.g. 'a', 'an', 'the' ... ) in a tweet , start assigning some kind of relevance rank to all the other word pairs found in the tweet
e.g. its likely that 'Cricket' and 'Sachin' (Or 'NBA' and <top NBA player) will both increase each other's relevance rank, WRT to each-other.
3. Process all the tweets like this, while maintaining the output of step 2 in a most suitable data structure. You would also need to start dropping word-pairs (to avoid having the problem of storing Million-C-2 words! ) based on some logic/heuristic.
4. If we have a good logic for having reasonably not-big storage (by avoiding the million-c-2 explosion), then what we have is at the end of processing: A simple look up of a million words, where each word has its 'top' max_allowed(k) semantic words.
PS: The most complex piece in the approach is to come out with a solution for dropping word-pairs (in step 3)
I have a preliminary implementation of this, geared for map/reduce-style parallelism in Perl, up at http://github.com/spencertipping/metaoptimize-challenge (in the fcm directory). It may be a start to solving step 3 -- using reduce-by-two on each step and dropping low-relevance words (I think this is valid, though I'm not 100% certain).
Hi!
Thanks for reading my suggested approach. I can understand how Map-reduce can be used to process the things faster (by processing them in parallel and later using reduce to aggregate the results??)
But can't imagine how reduce (of map-reduce) can help in dropping of word-pairs (i.e. in Step 3)
I think it's an imperfect solution, though it might still be solid. It depends on your reduction strategy (basically breaking associativity). If you do a fold-left, then you'll accumulate such high relevance that you'll quickly start discarding every word in the right-hand data set. On the other hand, if you binary-split the folding process you're more likely to be OK.
I used a binary split and due to the sparseness of the data (and the uninsightfulness of my algorithm) didn't run into too many spurious drop cases. But the implementation I posted is very basic and shows my lack of background in NLP-related pursuits -- I'm lucky it did anything useful at all :)
I will no bother with this challenge, since there is no a clear good solution. I would find more interesting working with semantically rich documents in which your work can be tested to measure the quality of your solution. That is, a good work must always gain enough recognition to be pursued.
If I had enough time for the problem I would first use a morphological and syntactical analyzer to try (at least with a high probability) to determine the morphological structure of the words and I would tag them with their syntactical category (part of speech). So I would treat for example 'work' and 'worked' as being the same word. On the other hand I would treat 'can:aux' and 'can:noun' as different words.
Now I would completely eliminate all the words which are not nouns not adjectives and not verbs. (the, a, can:aux, may, etc...) These are 'asemantical' words.
Maybe eliminate extremely rare words also (with 1 or 2 occurences)
After this preprocessing now comes the semantical relatedness analysis.
A very primitive solution:
----
Compute the ten best B for A where the following is maximal:
f(AB) / (f(A) * f(B))
f can be the frequency. f(AB) can be the frequency of occuring together in a document or within a distance (like 5 words). But you can fine tune it: weight it with the actual distance.
You don't have to maintain the full matrix. As you go through the documents just maintain max 100 words for each word for which this measure is maximal, and in the end just take the 10 maximal from the 100.
----
A more sophisticated solution for the semantically relatedness analysis:
Let's put all the possible words into a 2 dimensional map with totally random x,y coordinates. Now as you go through the documents imagine that the words which occur near each other attract each other on the map so their distance is made *=0.8. They gravitate. Imagine that the mass of the word is its frequency: the distance is made smaller by small frequency words going towards the big frequency word a lot, and the a big frequency word goes just a litle bit. Just like gravitation. Of course time to time we change the scale: we blow the whole map up a bit to not make everything gravitate into one point. This algorithm is a 'global algorithm' in the sense that it takes into account indirect things like synonims which are semantically really related even though they not really occur in the same sentence. In the end determining the 10 nearest neighbours fast is easy: you can use a simple grid as 'space partitioning'.
Edit: A 2 dimensional map s not necessarily optimal, it might be too tight. I would also experiment with setting the map dimension to 3,4,5 etc...
Couple of approaches that I can think of from papers of which combining in some way would probably mean decent results. Matching left and right fragments of n length, adding POS tagging to the matching fragments and using wordnet relation closeness for terms that exist in wordnet.
I encourage anyone to use Extractiv (http://www.extractiv.com) for this challenge. We're rolling out an On-Demand (local document analysis similar to Calais) Semantic Conversion service that could be helpful.
Also, the pricing page says users need to buy a plus or premium subscription for access to Semantic On-Demand, but confusingly says the basic plan includes 'On-Demand access to 1,000 documents / day' - I presume you meant to emphasize the unlimited usage available with the premium plans.
From the website, here are some example "documents":
Now, what is the point in trying to generate any kind of "meaning" from those documents if they consist of completely meaningless gibberish?As I was reading this challenge, I immediately thought of spam filtering / youtube comment classification ("smartness" classification) / etc as a potential useful application of this technology.
For example if each "document" is a youtube comment, then in theory you could write an algorithm to examine each comment and output a "smartness guess" for each. Then you (as in, you personally, by hand) would look at the results and specify your own "smartness rating" for a few comments. Then you'd run an algorithm to look at the difference between your specified "smartness rating" and the "smartness guess". Then, using that difference, it would tweak the settings in the original algorithm until it outputs a "smartness guess" that more closely fits your "smartness rating". If you repeat that process enough times, and your original algorithm has enough modifiers to tweak, then you might wind up with an algorithm that can make a pretty good guess about whether any given youtube comment is retarded or not.
And that's just one example of a practical application for this kind of thing.
That said, if the input "documents" are completely and utterly meaningless, then there does not seem to be any point in trying to build "meaning" from those inputs. (Garbage in, garbage out.)