Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I find this similar to what relation vectors do in word2vec: you can add a vector of "X of" and often get the correct answer. It could be that the principle is still the same, and transformers "just" build a better mapping of entities into the embedding space?


I think so. It’s hard for me to believe that the decision surfaces inside those models are really curved enough (like the folds of your brain) to really take advantage of FP32 numbers inside vectors: that is I just don’t believe it is

  x = 0 means “fly”
  x = 0.01 means “drive”
  x = 0.02 means “purple”
but rather more like

  x < 1.5 means “cold”
  x > 1.5 means “hot”
which is one reason why quantization (often 1 bit) works. Also it is a reason why you can often get great results feeding text or images through a BERT or CLIP-type model and then applying classical ML models that frequently involve linear decision surfaces.


Are you conflating nonlinear embedding spaces with the physical curvature of the cerebellum? I don't think there's a direct mapping.


My mental picture is that violently curved decision surfaces could look like the convolutions of the brain even though they have nothing to do with how the brain actually works.

I think of how tSNE and other algorithms sometimes produce projections that sometimes look like that (maybe that’s just what you get when you have to bend something complicated to fit into a 2-d space) and frequently show cusps that to me look like a sign of trouble (took me a while in my PhD work to realize how Poincaré sections from 4 or 6 dimensions can look messed up when a part of the energy surface tilts perpendicularly to the projection surface.)

I still find it hard to believe that dense vectors are the right way to deal with text despite the fact that they work so well. For images it is one thing because changing one pixel a little doesn’t change the meaning of an image, but changing a single character of a text can completely change the meaning of the text. Also there’s the reality that if you randomly stick together tokens you get something meaningless, so it seems almost all of the representation space covers ill formed texts and only a low dimensional manifold holds the well formed texts. Now the decision surfaces really have to be nonlinear and crumpled over all but I think there’s a definitely a limit on how crumpled those surfaces can be.


This is interesting. It makes me think of an "immersion"[0], as in a generalization of the concept of "embedding" in differential geometry.

I share your uneasiness about mapping words to vectors and agree that it feels as if we're shoehorning some more complex space into a computationally convenient one.

[0] https://en.wikipedia.org/wiki/Immersion_(mathematics)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: