I suspect it will take more like 10 years at least to produce convincing video. The technology isn't too far off except that the compute requirements are pretty extreme without some clever work. Lots of clever stitching needs to be done too.
You need models that can take a description of a scene and produce a story-board like series of low res images. (And maybe vice versa). Then, you need a model that can infer semantics and movement in logical ways between those panels to generate images to fill in the gaps. Then lots and lots of clever cleanup and resolution enhancement both of individual frames and the changes between neighboring frames without introducing all kinds of weird, fuzzy, moving, dream-like artifacts.
...Then you've got to somehow add audio that ALSO understands semantics in the same way as the story board. Maybe something that can generate an audio clip to go with the storyboard. ....And then fill in the gaps based on the generated video. Making those match seems like a really hard, but not impossible problem. In the short term, a bunch of moaning at appropriate times to mouths moving and whatnot seems feasible though.
Although, I expect fairly high quality text-to-image porn is likely only a few months to a year away.
The technology is there, someone just needs to pay to train the model... and then the cost of compute is what, like 300 grand? A few hundred more should get you enough engineering to apply existing techniques. Say $1 million in costs for a product that seems like incentive enough to get a bunch of members to pay a monthly fee.
At the rate AI image generation is going, I highly doubt it'll take another 10 years. Only 10 years ago did AlexNet come onto the scene and blow away image recognition contests.
- we will surely see sequential image synthesis by then
- we will surely see matching motion audio synthesis
- we will surely see single image to 3D reconstruction
- we will surely see haptic feedback and VR progress
- we will win.