Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This sounds interesting but the main challenge here seems to be to define the question: What is a semantically related word. I can think of three classes - variations (singular/plural), synonyms (thesaurus), contextual (occur in the same sentence/paragraph). Once the definition of "semantically related" is established, it should be easy to find a clever algorithm to produce results. However, out of context the problem is a little bit boring - in order to find a good solution I'd like to know the real world problem the list of "semantically related" words is going to solve.


There are two aspects of the problem that I consider particularly interesting: a) defining the problem, because being able to define the prob­lem and jus­tify your answer is half the puzzle in a lot of NLP work. b) scaling to a corpus of this size (vocabulary of 1M), because this scale is tricky but useful in many web problems.

Semioticians typically distinguish between paradigmatic and syntagmatic axes of semantic relatedness. Paradigmatic means that two words occur with similar other words, e.g. these two words typically have the same word immediately to their left, like "blue" and "azure". Syntagmatic means that the words typically co-occur in usage, like "blue" and "sky". Check out the image on this page for another illustration of these axes: http://www.aber.ac.uk/media/Documents/S4B/sem03.html

Regardless of whether you choose to do a paradigmatic or syntagmatic analysis, it's interesting to see how you motivate your approach and if you can scale it to 1M different vocabulary words.


scaling to a corpus of this size ... it's interesting to see how you motivate your approach and if you can scale it to 1M different vocabulary words

It looks like the sparsity of the matrix is going to be a much bigger challenge than the scale.

I understand about the focus being primarily on the approach, that makes sense; how are you intending to evaluate the results files?


It looks like the sparsity of the matrix is going to be a much bigger challenge than the scale.

Sparsity is good. The sparsity is the only reason that you can keep a matrix with this many dimensions in memory.

I understand about the focus being primarily on the approach, that makes sense; how are you intending to evaluate the results files?

For any submission, I will post for a random subset of vocab words each entry's 10 related terms. I'll then ask people to vote blind.


I think the more fair comparison would be to show THE SAME random subset for each entry. I.e. same X words for all result sets.

Otherwise, it might happen that a superior result would just show words that don't even have good results, while an inferior subset would get better covered words.


Yes, it will be the same random subset for all participants. Sorry if that wasn't clear.


If the documents they provide are meaningful texts they probably should be contextual (such corpus would be mostly useless for finding synonyms).


> What is a semantically related word?

Right, I was going to say, just get a dump of wordnet and build a thesaurus graph...


I think the challenge implies that the only data you can use is the provided dataset. Otherwise it's just make no sense - if you pull the similarity data from other place - then where's the challenge?


That's why I'm confused.

Somebody else in this thread brought up performing some kind of k-gram analysis and building a "thesaurus" of sorts from that. While that can be really good for vector space style document matching, if you try and actually "read" the results, you can get some weirdness.

  The duck died.
  The car died.
Ergo duck <semantically equivalent> car.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: