For me the key point to understand what's going on(assumming I got it), is that ...

For me the key point to understand what's going on(assumming I got it), is that the hidden layer "has" to produce similar representations for words that appear in the same contexts so the output layer can predict them.

The intuition behind doc2vec it's a bit harder to grasp. I understand the role of the "paragraph word": it provides context to the prediction so in "the ball hit the ---" in a basketball text the classifier would predict "rim" and in a football one "goalpost"(simplifying). But I still don't get why similar texts get similar latent representations.