This doesn't make much sense.... From the website, here are some example "docume...

bravura · on Nov 6, 2010

Now, what is the point in trying to generate any kind of "meaning" from those documents if each document consists of completely meaningless gibberish?

A lot of data with a small amount of noise can be much more informative than a small amount of "perfect" data.

If 1-5% of your examples are noise, you can still extract meaning, and find patterns.

Also, that's not meaningless gibberish, there is still signal in those examples.

puredemo · on Nov 6, 2010

Semantically related != meaningful.

Also, they do note that, "'Semantically related' is purposefully left vague."

rarestblog · on Nov 6, 2010

I'm sorry, but "Semantics" is "the study of meaning" (at least according to Wikipedia). How's that possible that "Semantically related != meaningful"?

Qz · on Nov 6, 2010

You left out a word - 'related'. Consider:

Tennis Playpen

Semantically related yes, meaningful no.

palish · on Nov 6, 2010

Then what is the point?

Or rather... Could I make a youtube comment classifier and enter it?

retube · on Nov 6, 2010

It's possible he's randomly generated the docs or something. Which, I agree, is a bit lame. It's nice to get optical confirmation that what you're doing is correct, and to get some real insight into language.

Much better would to have been take real sentences from wikipedia.

bravura · on Nov 6, 2010

No. Grandparent has picked out a few pathological examples out of many good ones.

Here is a uniform, random sample of documents:

  faststretch details
  honourable judge
  oriental museum http
  london wc 1n 3 bg pdsg seminar
  1857-8 indian mutiny
  vera bell
  katherine r.b
  rare books collection
  novy arbat street
  universitys case

The data set is unique terms that occur in a crawl of .uk. Here is how I constructed the data set:

I took the UKWAC web-as-corpus crawl (2 billion words, crawled in 2008), ran it through the splitta sentence splitter, removed all funny characters, ran the Penn treebank word tokenizer, and perform term extraction with topia.termextract, discarding terms that will single words. I then lowercased the terms, sorted them, and uniqued them, to give the dataset.