Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NLP Challenge: Find semantically related terms over a large vocabulary (1M)? (metaoptimize.com)
89 points by bravura on Nov 6, 2010 | hide | past | favorite | 35 comments


This doesn't make much sense....

From the website, here are some example "documents":

  roger heavens trinity cricketers willie sugg early cambridge cricket giles phillips cambridge university cricket

  roger hornetttom pridmore 020 7399 4270 collins stewartalan geeves 020 7523 8800 buchanan communicationscharles ryland /isabel podda 020 7466 5000

  roger hubbold aig .1 force-field navigation
Now, what is the point in trying to generate any kind of "meaning" from those documents if they consist of completely meaningless gibberish?

As I was reading this challenge, I immediately thought of spam filtering / youtube comment classification ("smartness" classification) / etc as a potential useful application of this technology.

For example if each "document" is a youtube comment, then in theory you could write an algorithm to examine each comment and output a "smartness guess" for each. Then you (as in, you personally, by hand) would look at the results and specify your own "smartness rating" for a few comments. Then you'd run an algorithm to look at the difference between your specified "smartness rating" and the "smartness guess". Then, using that difference, it would tweak the settings in the original algorithm until it outputs a "smartness guess" that more closely fits your "smartness rating". If you repeat that process enough times, and your original algorithm has enough modifiers to tweak, then you might wind up with an algorithm that can make a pretty good guess about whether any given youtube comment is retarded or not.

And that's just one example of a practical application for this kind of thing.

That said, if the input "documents" are completely and utterly meaningless, then there does not seem to be any point in trying to build "meaning" from those inputs. (Garbage in, garbage out.)


Now, what is the point in trying to generate any kind of "meaning" from those documents if each document consists of completely meaningless gibberish?

A lot of data with a small amount of noise can be much more informative than a small amount of "perfect" data.

If 1-5% of your examples are noise, you can still extract meaning, and find patterns.

Also, that's not meaningless gibberish, there is still signal in those examples.


Seman­ti­cally related != meaningful.

Also, they do note that, "'Seman­ti­cally related' is pur­pose­fully left vague."


I'm sorry, but "Semantics" is "the study of meaning" (at least according to Wikipedia). How's that possible that "Semantically related != meaningful"?


You left out a word - 'related'. Consider:

Tennis Playpen

Semantically related yes, meaningful no.


Then what is the point?

Or rather... Could I make a youtube comment classifier and enter it?


It's possible he's randomly generated the docs or something. Which, I agree, is a bit lame. It's nice to get optical confirmation that what you're doing is correct, and to get some real insight into language.

Much better would to have been take real sentences from wikipedia.


No. Grandparent has picked out a few pathological examples out of many good ones.

Here is a uniform, random sample of documents:

  faststretch details
  honourable judge
  oriental museum http
  london wc 1n 3 bg pdsg seminar
  1857-8 indian mutiny
  vera bell
  katherine r.b
  rare books collection
  novy arbat street
  universitys case
The data set is unique terms that occur in a crawl of .uk. Here is how I constructed the data set:

I took the UKWAC web-as-corpus crawl (2 bil­lion words, crawled in 2008), ran it through the splitta sen­tence split­ter, removed all funny char­ac­ters, ran the Penn tree­bank word tok­enizer, and per­form term extrac­tion with topia.termextract, dis­card­ing terms that will sin­gle words. I then low­er­cased the terms, sorted them, and uniqued them, to give the dataset.


That's somewhat similar to my second mini-project of my NLP course (that I just delivered like one week ago).

What we had to do was, first to gather a large corpus (in this case) of portuguese. Then, produce the frequencies of each unigram and bigram. Then, process those frequencies and output the top 10 and bottom 10 collocations of the corpus. Collocation definition from wikipedia: "Within the area of corpus linguistics, collocation defines a sequence of words or terms that co-occur more often than would be expected by chance".

The method used to calculate the collocations was based on a chi-square test. The values with higher chi-square were the most probable collocations (if above the critical value that would reject the null hypothesis).

So, I actually have Ruby code done that I could change just a little bit for this, but there are probably more clever methods.


This sounds interesting but the main challenge here seems to be to define the question: What is a semantically related word. I can think of three classes - variations (singular/plural), synonyms (thesaurus), contextual (occur in the same sentence/paragraph). Once the definition of "semantically related" is established, it should be easy to find a clever algorithm to produce results. However, out of context the problem is a little bit boring - in order to find a good solution I'd like to know the real world problem the list of "semantically related" words is going to solve.


There are two aspects of the problem that I consider particularly interesting: a) defining the problem, because being able to define the prob­lem and jus­tify your answer is half the puzzle in a lot of NLP work. b) scaling to a corpus of this size (vocabulary of 1M), because this scale is tricky but useful in many web problems.

Semioticians typically distinguish between paradigmatic and syntagmatic axes of semantic relatedness. Paradigmatic means that two words occur with similar other words, e.g. these two words typically have the same word immediately to their left, like "blue" and "azure". Syntagmatic means that the words typically co-occur in usage, like "blue" and "sky". Check out the image on this page for another illustration of these axes: http://www.aber.ac.uk/media/Documents/S4B/sem03.html

Regardless of whether you choose to do a paradigmatic or syntagmatic analysis, it's interesting to see how you motivate your approach and if you can scale it to 1M different vocabulary words.


scaling to a corpus of this size ... it's interesting to see how you motivate your approach and if you can scale it to 1M different vocabulary words

It looks like the sparsity of the matrix is going to be a much bigger challenge than the scale.

I understand about the focus being primarily on the approach, that makes sense; how are you intending to evaluate the results files?


It looks like the sparsity of the matrix is going to be a much bigger challenge than the scale.

Sparsity is good. The sparsity is the only reason that you can keep a matrix with this many dimensions in memory.

I understand about the focus being primarily on the approach, that makes sense; how are you intending to evaluate the results files?

For any submission, I will post for a random subset of vocab words each entry's 10 related terms. I'll then ask people to vote blind.


I think the more fair comparison would be to show THE SAME random subset for each entry. I.e. same X words for all result sets.

Otherwise, it might happen that a superior result would just show words that don't even have good results, while an inferior subset would get better covered words.


Yes, it will be the same random subset for all participants. Sorry if that wasn't clear.


If the documents they provide are meaningful texts they probably should be contextual (such corpus would be mostly useless for finding synonyms).


> What is a semantically related word?

Right, I was going to say, just get a dump of wordnet and build a thesaurus graph...


I think the challenge implies that the only data you can use is the provided dataset. Otherwise it's just make no sense - if you pull the similarity data from other place - then where's the challenge?


That's why I'm confused.

Somebody else in this thread brought up performing some kind of k-gram analysis and building a "thesaurus" of sorts from that. While that can be really good for vector space style document matching, if you try and actually "read" the results, you can get some weirdness.

  The duck died.
  The car died.
Ergo duck <semantically equivalent> car.


I checked in a simple co-occurrence frequency based implementation for this here https://github.com/tgflynn/NLP-Challenge with a short discussion on my blog: http://cogniception.com/wp/.



Hmm. Without any thought given to computability: Rearrange the words of every document into alphabetical order, then sort the whole list of documents into alphabetical order. Now, start reading the documents, and for each one make it a tree with each word a leaf, BUT if there is already a tree with a fragment of the same phrase, then that tree is linked to as a node of the new tree.

So for the document 'christmas day' the tree

______d1______.

__/_________\__.

christmas___day

is formed. But for the document 'chistmas day special', the tree

_____d2______.

___/____\____.

d1_______special

is formed. Note: word order doesn't matter. For any given document, how will it know if there's already a tree that uses some of the same words? When a tree is created from a document, give it a hash for the full set of words, plus hashes for sub-strings. These can be stored in a big lookup table, so new documents know where to look for relevant trees (or -- use whatever solution is well established for this sort of problem). Might also need to algorithmically redefine older trees as we go along, ie. divide a tree into smaller phrases

For each leaf let it keep track of what trees its part of (including grandparents and above). Then to find the top-k, for each leaf object (if it's in the vocabulary), print out all the other leaves of all the trees its part of. Like

if leaf.__str__ == 'day':

___ for tree in leaf.trees

______ print tree.leaves

Then do a simple frequency count to find the top-k associated words.

(sorry for the n00b answer... IANACS)


One way of doing this:

1. Get some million tweets from twitter API (each tweet is exactly like what the article describes as a 'document')

2. After ignoring common words (e.g. 'a', 'an', 'the' ... ) in a tweet , start assigning some kind of relevance rank to all the other word pairs found in the tweet e.g. its likely that 'Cricket' and 'Sachin' (Or 'NBA' and <top NBA player) will both increase each other's relevance rank, WRT to each-other.

3. Process all the tweets like this, while maintaining the output of step 2 in a most suitable data structure. You would also need to start dropping word-pairs (to avoid having the problem of storing Million-C-2 words! ) based on some logic/heuristic.

4. If we have a good logic for having reasonably not-big storage (by avoiding the million-c-2 explosion), then what we have is at the end of processing: A simple look up of a million words, where each word has its 'top' max_allowed(k) semantic words.

PS: The most complex piece in the approach is to come out with a solution for dropping word-pairs (in step 3)


I have a preliminary implementation of this, geared for map/reduce-style parallelism in Perl, up at http://github.com/spencertipping/metaoptimize-challenge (in the fcm directory). It may be a start to solving step 3 -- using reduce-by-two on each step and dropping low-relevance words (I think this is valid, though I'm not 100% certain).


Hi! Thanks for reading my suggested approach. I can understand how Map-reduce can be used to process the things faster (by processing them in parallel and later using reduce to aggregate the results??)

  But can't imagine how reduce (of map-reduce) can help in dropping of word-pairs (i.e. in Step 3)


I think it's an imperfect solution, though it might still be solid. It depends on your reduction strategy (basically breaking associativity). If you do a fold-left, then you'll accumulate such high relevance that you'll quickly start discarding every word in the right-hand data set. On the other hand, if you binary-split the folding process you're more likely to be OK.

I used a binary split and due to the sparseness of the data (and the uninsightfulness of my algorithm) didn't run into too many spurious drop cases. But the implementation I posted is very basic and shows my lack of background in NLP-related pursuits -- I'm lucky it did anything useful at all :)


I will no bother with this challenge, since there is no a clear good solution. I would find more interesting working with semantically rich documents in which your work can be tested to measure the quality of your solution. That is, a good work must always gain enough recognition to be pursued.


My first idea would be the following:

If I had enough time for the problem I would first use a morphological and syntactical analyzer to try (at least with a high probability) to determine the morphological structure of the words and I would tag them with their syntactical category (part of speech). So I would treat for example 'work' and 'worked' as being the same word. On the other hand I would treat 'can:aux' and 'can:noun' as different words.

Now I would completely eliminate all the words which are not nouns not adjectives and not verbs. (the, a, can:aux, may, etc...) These are 'asemantical' words.

Maybe eliminate extremely rare words also (with 1 or 2 occurences)

After this preprocessing now comes the semantical relatedness analysis.

A very primitive solution:

----

Compute the ten best B for A where the following is maximal:

f(AB) / (f(A) * f(B))

f can be the frequency. f(AB) can be the frequency of occuring together in a document or within a distance (like 5 words). But you can fine tune it: weight it with the actual distance.

You don't have to maintain the full matrix. As you go through the documents just maintain max 100 words for each word for which this measure is maximal, and in the end just take the 10 maximal from the 100.

----

A more sophisticated solution for the semantically relatedness analysis:

Let's put all the possible words into a 2 dimensional map with totally random x,y coordinates. Now as you go through the documents imagine that the words which occur near each other attract each other on the map so their distance is made *=0.8. They gravitate. Imagine that the mass of the word is its frequency: the distance is made smaller by small frequency words going towards the big frequency word a lot, and the a big frequency word goes just a litle bit. Just like gravitation. Of course time to time we change the scale: we blow the whole map up a bit to not make everything gravitate into one point. This algorithm is a 'global algorithm' in the sense that it takes into account indirect things like synonims which are semantically really related even though they not really occur in the same sentence. In the end determining the 10 nearest neighbours fast is easy: you can use a simple grid as 'space partitioning'.

Edit: A 2 dimensional map s not necessarily optimal, it might be too tight. I would also experiment with setting the map dimension to 3,4,5 etc...


try it and see :D

I think I'm going to be attempting a solution, too (though it's not the same as yours)


Couple of approaches that I can think of from papers of which combining in some way would probably mean decent results. Matching left and right fragments of n length, adding POS tagging to the matching fragments and using wordnet relation closeness for terms that exist in wordnet.


I encourage anyone to use Extractiv (http://www.extractiv.com) for this challenge. We're rolling out an On-Demand (local document analysis similar to Calais) Semantic Conversion service that could be helpful.


Slow down the rollout a little: your 'take a tour' page inline link to Semantic On Demand 404s, as does the inline link from your pricing page:

http://www.extractiv.com/tour-semantic-on-deman.html - missing letter in '...deman.html'

http://www.extractiv.com/tour-semantic-on-demand - missing '.html'

Also, the pricing page says users need to buy a plus or premium subscription for access to Semantic On-Demand, but confusingly says the basic plan includes 'On-Demand access to 1,000 documents / day' - I presume you meant to emphasize the unlimited usage available with the premium plans.


also, the example from here http://www.extractiv.com/tour-semantic-on-demand.html does not work.

I second, this seems interesting (and I would be really interested in seeing what you produce as "relations")


PS - I forgot to say that it looks extremely interesting from what I've read so far.


"Extractiv lets you transform unstructured web content into highly-structured semantic data."

Suggestion: Can you put sample output on your homepage, so we can see what sort of semantic analysis you perform?

p.s. this tool doesn't have anything to do with the task.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: