From the website, here are some example "documents":
roger heavens trinity cricketers willie sugg early cambridge cricket giles phillips cambridge university cricket
roger hornetttom pridmore 020 7399 4270 collins stewartalan geeves 020 7523 8800 buchanan communicationscharles ryland /isabel podda 020 7466 5000
roger hubbold aig .1 force-field navigation
Now, what is the point in trying to generate any kind of "meaning" from those documents if they consist of completely meaningless gibberish?
As I was reading this challenge, I immediately thought of spam filtering / youtube comment classification ("smartness" classification) / etc as a potential useful application of this technology.
For example if each "document" is a youtube comment, then in theory you could write an algorithm to examine each comment and output a "smartness guess" for each. Then you (as in, you personally, by hand) would look at the results and specify your own "smartness rating" for a few comments. Then you'd run an algorithm to look at the difference between your specified "smartness rating" and the "smartness guess". Then, using that difference, it would tweak the settings in the original algorithm until it outputs a "smartness guess" that more closely fits your "smartness rating". If you repeat that process enough times, and your original algorithm has enough modifiers to tweak, then you might wind up with an algorithm that can make a pretty good guess about whether any given youtube comment is retarded or not.
And that's just one example of a practical application for this kind of thing.
That said, if the input "documents" are completely and utterly meaningless, then there does not seem to be any point in trying to build "meaning" from those inputs. (Garbage in, garbage out.)
It's possible he's randomly generated the docs or something. Which, I agree, is a bit lame. It's nice to get optical confirmation that what you're doing is correct, and to get some real insight into language.
Much better would to have been take real sentences from wikipedia.
No. Grandparent has picked out a few pathological examples out of many good ones.
Here is a uniform, random sample of documents:
faststretch details
honourable judge
oriental museum http
london wc 1n 3 bg pdsg seminar
1857-8 indian mutiny
vera bell
katherine r.b
rare books collection
novy arbat street
universitys case
The data set is unique terms that occur in a crawl of .uk.
Here is how I constructed the data set:
I took the UKWAC web-as-corpus crawl (2 billion words, crawled in 2008), ran it through the splitta sentence splitter, removed all funny characters, ran the Penn treebank word tokenizer, and perform term extraction with topia.termextract, discarding terms that will single words.
I then lowercased the terms, sorted them, and uniqued them, to give the dataset.
From the website, here are some example "documents":
Now, what is the point in trying to generate any kind of "meaning" from those documents if they consist of completely meaningless gibberish?As I was reading this challenge, I immediately thought of spam filtering / youtube comment classification ("smartness" classification) / etc as a potential useful application of this technology.
For example if each "document" is a youtube comment, then in theory you could write an algorithm to examine each comment and output a "smartness guess" for each. Then you (as in, you personally, by hand) would look at the results and specify your own "smartness rating" for a few comments. Then you'd run an algorithm to look at the difference between your specified "smartness rating" and the "smartness guess". Then, using that difference, it would tweak the settings in the original algorithm until it outputs a "smartness guess" that more closely fits your "smartness rating". If you repeat that process enough times, and your original algorithm has enough modifiers to tweak, then you might wind up with an algorithm that can make a pretty good guess about whether any given youtube comment is retarded or not.
And that's just one example of a practical application for this kind of thing.
That said, if the input "documents" are completely and utterly meaningless, then there does not seem to be any point in trying to build "meaning" from those inputs. (Garbage in, garbage out.)