The main thing I'm confused about by these comments is that as far as I understand, the Sora model (and many others like it) performs the diffusion process in latent space, and then translates this to pixels.
So it's strange to me to claim that it doesn't have an abstract representation.
But, maybe the latent space of a diffusion-VQVAE pipeline is fundamentally different from that of JEPA, I haven't read the relevant papers for that. Curious if someone could explain if they are different ideas of representation.
The claim, with admittedly limited consensus, is that a difference in degree becomes a difference in kind: we’ve known “forever in AI years” that restricting the inputs, and loss, and parameter dimensionality is right up there with the underlying architecture in terms of the properties those latent geometries exhibit. It’s even kind of a meme: the fingers problem (which neither Sora nor V-JEPA partisans seem to be claiming is “solved”).
I’m a little oversubscribed at the moment so I haven’t downloaded the weights and played around with V-JEPA. The fact I’m pointing out that I could should make it pretty clear which way I lean on this: everyone wants to make money, let’s be real, but some seem resigned to “infinite, government-enforced monopoly via OpenPhilanthropy bribery” isn’t going to fly, so let’s at least prove how the sausage is made.
It’s fairly uncontroversial that things in the neighborhood of a “bottlenecked” VAE are often forced to exploit structure if it’s there.
This (claimed) result is about a way to exploit more of this structure/symmetry economy (greedy-ish optimizer) to pull the latent representations into yet a higher “effective regime” than yet demonstrated, with excellent properties around machine economics.
Representation learning isn’t new (though LeCun is an acknowledged pioneer in it), and constraints as a powerful tool isn’t new (causal making in attention architectures shouldn't ruffle many feathers).
Self/semi/unsupervised learning isn’t novel. But likewise, not are they k-equivalent synonyms: Dean et al. were distinguishing between Continuous Bag of Words and skip-gram in word2vec in 2013.
But it does cool shit (I like the I-JEPA reconstructions as a go to slide deck raster), and it’s available weight.
So it's strange to me to claim that it doesn't have an abstract representation.
But, maybe the latent space of a diffusion-VQVAE pipeline is fundamentally different from that of JEPA, I haven't read the relevant papers for that. Curious if someone could explain if they are different ideas of representation.