Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This doesn't make much sense....

From the website, here are some example "documents":

  roger heavens trinity cricketers willie sugg early cambridge cricket giles phillips cambridge university cricket

  roger hornetttom pridmore 020 7399 4270 collins stewartalan geeves 020 7523 8800 buchanan communicationscharles ryland /isabel podda 020 7466 5000

  roger hubbold aig .1 force-field navigation
Now, what is the point in trying to generate any kind of "meaning" from those documents if they consist of completely meaningless gibberish?

As I was reading this challenge, I immediately thought of spam filtering / youtube comment classification ("smartness" classification) / etc as a potential useful application of this technology.

For example if each "document" is a youtube comment, then in theory you could write an algorithm to examine each comment and output a "smartness guess" for each. Then you (as in, you personally, by hand) would look at the results and specify your own "smartness rating" for a few comments. Then you'd run an algorithm to look at the difference between your specified "smartness rating" and the "smartness guess". Then, using that difference, it would tweak the settings in the original algorithm until it outputs a "smartness guess" that more closely fits your "smartness rating". If you repeat that process enough times, and your original algorithm has enough modifiers to tweak, then you might wind up with an algorithm that can make a pretty good guess about whether any given youtube comment is retarded or not.

And that's just one example of a practical application for this kind of thing.

That said, if the input "documents" are completely and utterly meaningless, then there does not seem to be any point in trying to build "meaning" from those inputs. (Garbage in, garbage out.)



Now, what is the point in trying to generate any kind of "meaning" from those documents if each document consists of completely meaningless gibberish?

A lot of data with a small amount of noise can be much more informative than a small amount of "perfect" data.

If 1-5% of your examples are noise, you can still extract meaning, and find patterns.

Also, that's not meaningless gibberish, there is still signal in those examples.


Seman­ti­cally related != meaningful.

Also, they do note that, "'Seman­ti­cally related' is pur­pose­fully left vague."


I'm sorry, but "Semantics" is "the study of meaning" (at least according to Wikipedia). How's that possible that "Semantically related != meaningful"?


You left out a word - 'related'. Consider:

Tennis Playpen

Semantically related yes, meaningful no.


Then what is the point?

Or rather... Could I make a youtube comment classifier and enter it?


It's possible he's randomly generated the docs or something. Which, I agree, is a bit lame. It's nice to get optical confirmation that what you're doing is correct, and to get some real insight into language.

Much better would to have been take real sentences from wikipedia.


No. Grandparent has picked out a few pathological examples out of many good ones.

Here is a uniform, random sample of documents:

  faststretch details
  honourable judge
  oriental museum http
  london wc 1n 3 bg pdsg seminar
  1857-8 indian mutiny
  vera bell
  katherine r.b
  rare books collection
  novy arbat street
  universitys case
The data set is unique terms that occur in a crawl of .uk. Here is how I constructed the data set:

I took the UKWAC web-as-corpus crawl (2 bil­lion words, crawled in 2008), ran it through the splitta sen­tence split­ter, removed all funny char­ac­ters, ran the Penn tree­bank word tok­enizer, and per­form term extrac­tion with topia.termextract, dis­card­ing terms that will sin­gle words. I then low­er­cased the terms, sorted them, and uniqued them, to give the dataset.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: