It just plain isn't possible if you mean a prompt the size of what most people have been using lately, in the couple hundred character range. By sheer information theory, the number of possible interpretations of "a zoom in on a happy dog catching a frisbee" means that you can not match a particular clip out of the set with just that much text. You will need vastly more content; information about the breed, information about the frisbee, information about the background, information about timing, information about framing, information about lighting, and so on and so forth. Right now the AIs can't do that, which is to say, even if you sit there and type a prompt containing all that information, it is going to be forced to ignore most of the result. Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.
This isn't a matter of human-level AI or superhuman-level AI; it's just straight up impossible. If you want the information to match, it has to be provided. If it isn't there, an AI can fill in the gaps with "something" that will make the scene work, but expecting it to fill in the gaps the way you "want" even though you gave it no indication of what that is is expecting literal magic.
Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible. Some sort of long-form "write me a horror movie staring a precocious 22-year old elf in a far-future Ganymede colony with a message about the importance of friendship" AI that generates a coherent movie of many scenes will have to be doing a lot of some sort of internal communication in an internal language to hold the result together between scenes, because what it takes to hold stuff coherent between scenes is an amount of English text not entirely dissimilar in size from the underlying representation itself. You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.
And this applies to language / code outputs as well.
The number of times I’ve had engineers at my company type out 5 sentences and then expect a complete react webapp.
But what I’ve found in practice is using LLMs to generate the prompt with low-effort human input (eg: thumbs up/down, multiple-choice etc) is quite useful. It generates walls of text, but with metaprompting, that’s kind of the point. With this, I’ve definitely been able to get high ROI out of LLMs. I suspect the same would work for vision output.
I'm not sure, but I think you're saying what I'm thinking.
Stick the video you want to replicate into -o1 and ask for a descriptive prompt to generate a video with the same style and content. Take that prompt and put it into Sora. Iterate with human and o1 generated critical responses.
I suspect you can get close pretty quickly, but I don't know the cost. I'm also suspicious that they might have put in "safeguards" to prevent some high profile/embarrassing rip-offs.
> Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible.
Why snippets? Submit a whole script the way a writer delivers a movie to a director. The (automated) director/DP/editor could maintain internal visual coherence, while the script drives the story coherence.
This almost certainly won’t work. Feel free to feed any of the hundreds of existing film scripts and test how coherent the models can be. My guess is not at all
> The clips on the Sora site today would have been utterly astonishing ten years ago.
Yeah, and Apollo 11 would have been utterly astonishing a decade before it occurred. And, yet, if you tried to project out from it to what further frontiers manned spaceflight would reach in the following decades, you’d…probably grossly overestimate what actually occurred.
> Long term progress can be surprising.
Sure, it can be surprising for optimists as well as naysayers; as a good rule of thumb, every curve that looks exponential in an early phase ends up being at best logistic.
In the long run we are all dead. Saying that technology will be better in the future is almost eye-roll worthy. The real task is predicting what future technology will be, and when it will arrive.
Ask anyone with a chronic illness about the future and they'll tell you we're about 5 years off a cure. They've been saying that for decades. Who knows where the future advancements will be.
The Blair Witch Project was a (surprise) creative masterpiece. It worked with very limited technology to create a very clever plot which was paired with an amazing marketing. The combination of which the world hadn’t seen before. It took some creative geniuses to peace the Blair Witch Project together.
Generative AI will never produce an experience like that. I know never is a long time, but I’m still gonna call it. You simply can’t produce such a fresh idea by gathering a bunch of data and interpolating.
Maybe someday enough AI will be good enough to create shorter or longer videos with some dialog and even a coherent story (though I doubt it), but it won‘t be fresh or creative. And we humans will at best enjoy it for its stupidity or sloppiness. Not for its cleverness or artistry.
Why does the idea need to be generated by AI? Let people generate the ideas, the AI will help execute. I think soon (3-5 years) a determined person with no video skills will be able to put together a compelling movie (maybe a short). And that is massive. AI doesn’t have to do everything. Like all tech, it’s a productivity tool.
This is the at-first-fun-but-now-frustrating infinite goal move. "AI (a stand in for literally anything) will do (anything) soon." -> "It won't do (thing), it's too complex." -> "Who said AI will do (thing)?"
I'm suspicious of most claims of AI growth, but I think screenwriting is an area where there's real potential. There are many screenplays out there, many movie plots are very similar to each other, and human raters could help with training. And it's worth noting that the top four highest grossing movies right now are all sequels or film adaptations. It's not a huge leap to imagine an LLM in the future that's been trained on movie writing being able to create a movie script when given the Wicked musical.
https://www.imdb.com/chart/boxoffice/
The 2023 Writers Guild of America strike was in part to prevent screenplays being written entirely by generative AI.
So no I don’t think this will happen either. Authors may use use AI them selves as one tool in their tool box as they write their script, but we will not see entire production screen plays being written by generative AI set for theatrical release. The industry will simply not allow that to happen. At most you can have AI write a screen play for your own amusement, not for publication.
I'm thinking more of a Gibsonian 'Garage Kubrick'. A solitary auteur (or small team) that produces the film alone perhaps without even touching a camera, generating all the footage using AI (in the novel the auteur creates all the footage through photo/found-footage manipulation, or at least thats all we see in text). The script will probably be human written, I'm not talking about an AI producing a film from scratch, rather a film being produced using AI to create all the visuals and audio.
That is a far more reasonable prediction but I don’t even see this future. This kind of “film making” will at best be something generated for the amusement of the creator (think, give me a specific episode of Star Trek where Picard ...) or as prototypes or concepts of yet to be filmed with actual actors. And it certainly won’t be in theaters, not in 5 years, or ever.
Generative AI will not be able to approach the artistry of your average actor (not even a bad actor), it won’t be able match the lighting or the score to the mood (unless you carefully craft that in your prompt). It won‘t get creative with the camera angles (again unless you specifically prompt for a specific angle) or the cuts. And it probably won’t stay consistent with any of these, or otherwise break the consistency at the right moments, like an artist could.
If you manage to prompt the generative AI to create a full feature film with excellent acting, the correct lighting given the mood, a consistent tone with editing to match, etc. you have probably spent much more time and money into crafting the prompt than would otherwise have gone into simply hiring the crew to create your movie. The AI movie will certainly contain slop and be visibly so bad it guaranteed will not be in theaters.
Now if you hired that crew to make the movie instead, that crew might use AI as a tool to enhance their artistry, but you still need your specialized artists to use that tool correctly. That movie might make it to the theaters.
blair witch project looked like shit, 'the cinematography doesn't approach a true director of photography', the actors were shit... etc. Given the right script and concept it can be amazing and the imperfection of AI can become part of the aesthetic.
It was still a creative stroke of genius. The shit acting along with the shit cinemotography was preceded by a brilliant marketing campaign where you expected this lack of skill by the film makers.
In music you also have plenty of artists that have no clue how to play their instruments, or progress their songs, but the music is nonetheless amazing.
Skill is not the only quality of art. A brilliant artist works with their limitation to produce work which is better than the sum of its part. It will take AI the luck of ten billion universes before it produces anything like that.
So what you are saying is some aspects of movie making will use AI as parts of their jobs. That is very realistic and probably already happening.
Saying that large video models will be in theaters sounds like a completely different and much more ambitious prediction. I interpreted it as if large video models will produce whole movies on their own from a script of prompts. That there will be a single film maker with only a large video model and some prompts to make the movie. Such films will never be in the theater, unless by some grifter, and than it is certain to be a flop.
You should watch how movies are made sometime. How a script is developed. How changes to it are made. How storyboards are created. How actors are screened for roles. How locations are scouted, booked, and changed. How the gazillion of different departments end up affecting how a movie looks, is produced, made, and in which direction it goes (the wardrobe alone, and its availability and deadlines will have a huge impact on the movie).
What does "EXT. NIGHT" mean in a script? Is it cloudy? Rainy? Well lit? What are camera locations? Is the scene important for the context of the movie? What are characters wearing? What are they looking at?
What do actors actually do? How do they actually behave?
Here are a few examples of script vs. screen.
Here's a well described script of Whiplash. Tell me the one hundred million things happening on screen that are not in the script: https://www.youtube.com/watch?v=kunUvYIJtHM
Or here's Joker interrogation from The Dark Night Rises. Same million different things, including actors (or the director) ignoring instructions in the script: https://www.youtube.com/watch?v=rqQdEh0hUsc
I can't upvote this enough. This topic in the media space has generated a huge amount of naive speculation that amounts to "how hard could it be to do <thing i know nothing about>?"
> "how hard could it be to do <thing i know nothing about>?"
This is most Hacker News comments summarized lmao. It's kinda my favorite thing of this place: just open any thread and you immediately see so many people rushing to say ''well just do X or Y'' or ''actually it's X or Y and not Z like the experts claim''. Love it.
At the same time I am curious in the "that person has too many fingers" sense at what a system trained on tens of thousands of movies plus scripts plus subtitles plus metadata etc. would generate.
I thought about it for a bit and I would want to watch a computer generated Sharknado 7 or Hallmark Christmas movie.
Of course normally other people contribute to a movie after the writer. My comment mentioned three of the important roles. This whole thread is about tech that automates away those roles. That's the whole point.
Lets pick something concrete. It's a medieval script, it opens with two knights fighting. OK so later in the script we learn their characters, historic counterparts etc. So your LLM can match nefarious villain to some kind of embedding, and doubtless has trained on countless images of a knight.
But the result is not naively going to understand the level of reality the script is going for - how closely to stick to historic parallels, how much to go fantastical with the depiction. The way we light and shoot the fight and how it coheres with the themes of the scene, the way we're supposed to understand the characters in the context of the scene and the overall story, the references the scene may be making to the genre or even specific other films etc.
This is just barely scraping the surface of the beginnings of thinking about mise en scene, blocking, framing etc. You can't skip these parts - and they're just as much of a challenge as temporal coherence, or performance generation or any of the other hard 'technical issues' that these models have shown no capacity to solve. They're decisions that have to be made to make a film coherent at all - not yet good or tasteful or creative or whatever.
Put another way - you'd need AGI to comprehend a script at the level of depth required to do the job of any HOD on any film. Such a thing is doubtless possible, but it's not going to be shortcut naively the way generation an image is - because it requires understanding in context, precisely what LLMs lack.
> but the result is not naively going to understand the level of reality the script is going for…
We can already get detailed style guidance into picture generation. Declaring you want Picasso cubist, Warner brothers cartoon, or hyper realistic works today. So does lighting instructions, color palettes, on and on.
These future models will not be large language models, they will be multi-modal. Large movie models if you like. They will have tons of context about how scenes within movies cohere, just as LLMs do within documents today.
So, we went from "just hand off movie script to automated director/DP/editor" we're now rapidly approaching:
- you have to provide correct detailed instructions on lighting
- you have to provide correct detailed instructions on props
- you have to provide correct detailed instructions on clothing
- you have to provide correct detailed instructions on camera position and movement
- you have to provide correct detailed instructions on blocking
- you have to provide correct detailed instructions on editing
- you have to provide correct detailed instructions on music
- you have to provide correct detailed instructions on sound effects
- you have to provide correct detailed instructions on...
- ...
- repeat that for literally every single scene in the movie (up to 200 in extreme cases)
There's a reason I provided a few links for you to look at. I highly recommend the talk by Annie Atkins. Watch it, then open any movie script, and try to find any of the things she is talking about there (you can find actual movie scripts here: https://imsdb.com)
There's two reasons to be hopeful about it though: AI/LLMs are very good at filling in all those little details so humans can cherry pick the parts that they like. I think that's where the real value is in for the masses - once these models can generate coherent scenes, people can start using them to explore the creative space and figure out what they like. Sort of like SegmentAnything and masking in inpainting but for the rest of the scene assembly. The other reason is that the models can probably be architected to figure out environmental/character/light/etc embeddings and use those to build up other coherent scenes, like we use language embeddings for semantic similarity.
That's how I've been using the image generators - lots of experimentation and throwing out the stuff that doesn't work. Then once I've got enough good generated images collected out of the tons of garbage, I fine tune a model and create a workflow that more consistently gives me those styles.
Now the models and UX to do this at a cinematic quality are probably 5-10 years away for video (and the studios are probably the only ones with the data to do it), but I'm relatively bullish on AI in cinema. I don't think AI will be doing everything end to end, but it might be a shortcut for people who can write a script and figure out the UX to execute the rest of the creative process by trial and error.
> AI/LLMs are very good at filling in all those little details so humans can cherry pick the parts that they like.
Where did you find AI/ML that are good at filling in actual required and consistent details.
I beg of you to watch Annie Atkins' presentation I linked: https://www.youtube.com/watch?v=SzGvEYSzHf4 and tell me how much intervention would AI/ML need to create all that, and be consistent throughout the movie?
> once these models can generate coherent scenes, people can start using them to explore the creative space and figure out what they like.
Define "coherent scene" and "explore". A scene must be both coherent and consistent, and conform to the overall style of the movie and...
Even such a simple thing as shot/reverse shot requires about a million various details and can be shot in a million different ways. Here's an exploration of just shot/reverse shot: https://www.youtube.com/watch?v=5UE3jz_O_EM
All those are coherent scenes, but the coherence comes from a million decisions: from lighting, camera position, lens choice, wardrobe, what surrounds the characters, what's happening in the background, makeup... There's no coherence without all these choices made beforehand.
Around 4:00 mark: "Think about how well you know this woman just from her clothes, and workspace". Now watch that scene. And then read its description in the script https://imsdb.com/scripts/No-Country-for-Old-Men.html:
--- start quote ---
Chigurh enters. Old plywood paneling, gunmetal desk, litter
of papers. A window air-conditioner works hard.
A fifty-year-old woman with a cast-iron hairdo sits behind
the desk.
--- end quote ---
And right after that there's a section on the rhythm of editing. Another piece in the puzzle of coherence in a scene.
> Then once I've got enough good generated images collected out of the tons of garbage, I fine tune a model and create a workflow that more consistently gives me those styles.
That’s the same thing with digital art, even with the most effortless one (matte painting), there’s a plethora of decisions to make and techniques to use to have a coherent result. There’s a reason people go to school or trained themselves for years to get the needed expertise. If it was just data, someone would have written a guide that others would mindlessly follow.
Not sure why you jumped there. I was thinking more like ‘make it look like Bladerunner if Kurosawa directed it, with a score like Zimmer.’
You’re really failing to let go of the idea that you need to prescribe every little thing. Like Midjourney today, you’ll be able to give general guidance.
Now, I don’t expect we’ll get the best movies this way. But paint by numbers stuff like many movies already are? A Hallmark Channel weepy? I bet we will.
Your original claim: "Submit a whole script the way a writer delivers a movie to a director. The (automated) director/DP/editor could maintain internal visual coherence, while the script drives the story coherence."
Two comments later it's this: "We can already get detailed style guidance into picture generation. Declaring you want Picasso cubist, Warner brothers cartoon, or hyper realistic works today. So does lighting instructions, color palettes, on and on."
I just re-wrote this with respect to movies.
> I was thinking more like ‘make it look like Bladerunner if Kurosawa directed it, with a score like Zimmer.’
Because, as we all know, every single movie by Kurosawa is the same, as is every single score by Hans Zimmer, so it's ridiculously easy to recreate any movie in that style, with that music.
> You’re really failing to let go of the idea that you need to prescribe every little thing. Like Midjourney today, you’ll be able to give general guidance.
Yes, and Midjounrey today really sucks at:
- being consistent
- creating proper consistent details
A general prompt will give you a general result that is usually very far from what you actually have in mind.
And yes, you will have to prescribe a lot of small things if you want your movie to be consistent. And for your movie to make any sense.
Again, tell me how exactly your amazing magical AI director will know which wardrobe to chose, which camera angles to setup, which typography to use, which sound effects to make just from the script you hand in?
you can start ,with a very simple scene I referenced in my original reply: two people talking at the table in Whiplash.
> But paint by numbers stuff like many movies already are? A Hallmark Channel weepy? I bet we will.
Even those movies have more details and more care than you can get out of AIs (now, or in foreseeable future)
> Again, tell me how exactly your amazing magical AI director will know which wardrobe to chose, which camera angles to setup, which typography to use, which sound effects to make just from the script you hand in?
I think you're still assuming I always want to choose those things. That's why we're talking past each other. A good movie making model would choose for me unless I give explicit directions. Today we don't see long-range coherence in the results of movie (or game engine) models, but the range is increasing, and I'm willing to bet we will see movie-length coherence in the next decade or so.
By the way, I also bet that if I pasted exactly the No Country for Old Men script scene description from up this thread into Midjourney today it would produce at least some compelling images with decent choices of wardrobe, lighting, set dressing, camera angle, exposure, etc etc. That's what these models do, because they're extrapolating and interpolating between the billion images they've seen that contained these human choices.
AFAIK Midjourney produces single images, so the relevant scope of consistency is inside the single image only. Not between images. A movie model needs coherence across ~160,000 images, which is beyond the state of the art today but I don't see why it's impossible or unreasonable in the long run.
> A general prompt will give you a general result that is usually very far from what you actually have in mind.
Which is only a problem if I have something in mind. Alternatively I can give no guidance, or loose guidance, make half a dozen variations, pick the one I like best. Maybe iterate a couple of times into that variation tree. Just like the image generators do.
Shane Carruth (Primer) released interesting scripts for "A Topiary" and "The Modern Ocean" which now have no hope of being filmed. I hope AI can bring them to life someday. If we get tools like ControlNet for video, maybe Carruth could even "direct" them himself.
This exists already actually. Kling AI 1.5. Saw the demo on twitter two days ago, which shows a photo-to-video transformation on an image of three women standing on a beach, and the video transformation simulates the camera rotating, with the women moving naturally. Just involves a segment-anything style selection of the women, and drawing a basic movement vector.
That's what I describe at the end, albeit quickly in lingo, where the internal coherence is maintained in internal embeddings that are never related to English at all. A top-level AI could orchestrate component AIs through embedded vectors, but you'll never do it with a human trying to type out descriptions.
> Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.
The text encoder may not be able to know complex relationships, but the generative image/video models that are conditioned on said text embeddings absolutely can.
Flux, for example, uses the very old T5 model for text encoding, but image generations from it can (loosely) adhere to all rules and nuances in a multi-paragraph prompt: https://x.com/minimaxir/status/1820512770351411268
> but image generations from it can (loosely) adhere to all rules and nuances in a multi-paragraph prompt
Flux certainly does not consistently do so across an arbitrary collection of multi-paragraph prompts, as anyone whose run more than a few long prompts past it would recongize; also, the tweet is wrong in the other direction, as well, longer language-model-preprocessed prompts for models that use CLIP (like various SD1.5 and SDXL derivatives) are, in fact, a common and useful technique. (You’d kind of think that the fact that generated prompt here is significantly longer than the 256 token window of T5 would be a clue that the 77 token limit of CLIP might not be as big of a constraint as the tweet was selling it as, too.)
> You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.
How would you ever tweak or debug it in that case? It doesn't strictly have to be English, but some kind of human-readable representation of the intermediate stages will be vital.
Dreambooth style training is in no way out of date.
If you just want a face, InstandID/Pulid work - but it’s not going to be very varied. Doing actual training means you can get any perspective, lighting, style, expression, etc - and have the whole body be accurate.
How would that even work? A dog has physical features (legs, nose, eyes, ears, etc.) that they use to interact with the world around them (ground, tree, grass, sounds, etc.). And each one of those things has physical structures that compose senses (nervous system, optic nerves, etc.). There are layers upon layers of intricate complexity that took eons to develop and a single photo cannot encapsulate that level of complexity and density of information. Even a 3D scan can't capture that level of information. There is an implicit understanding of the physical world that helps us make sense of images. For example, a dog with all four paws standing on grass is within the bounds of possibility; a dog with six paws, two of which are on it's head, are outside the bounds of possibility. An image generator doesn't understand that obvious delineation and just approximates likelihood.
A single photo doesn't have to capture all that complexity. It's carried by all those countless dog photos and videos in the training set of the model.
For those not in this space, Sora is essentially dead on arrival.
Sora performs worse than closed source Kling and Hailuo, but more importantly, it's already trumped by open source too.
Tencent is releasing a fully open source Hunyuan model [1] that is better than all of the SOTA closed source models. Lightricks has their open source LTX model and Genmo is pushing Mochi as open source. Black Forest Labs is working on video too.
Sora will fall into the same pit that Dall-E did. SaaS doesn't work for artists, and open source always trumps closed source models.
Artists want to fine tune their models, add them to ComfyUI workflows, and use ControlNets to precision control the outputs.
Images are now almost 100% Flux and Stable Diffusion, and video will soon be 100% Hunyuan and LTX.
Sora doesn't have much market apart from name recognition at this point. It's just another inflexible closed source model like Runway or Pika. Open source has caught up with state of the art and is pushing past it.
Their online version is all in Chinese (or at least some Chinese-looking script I don't understand) ... and they recommend an 80GB GPU to run the thing, which costs ~€15-18k. Yikes, guess I won't be doing this at home anytime soon
something like a white paper with a mood board, color scheme, and concept art as the input might work. This could be sent into an LLM "expander" that increases the words and speficity. Then multiple reviews to tap things in the right direction.
I expect this kind of thing is actually how it's going to work longer term, where AI is a copilot to a human artist. The human artist does storyboarding, sketching in backdrops and character poses in keyframes, and then the AI steps in and "paints" the details over top of it, perhaps based on some pre-training about what the characters and settings are so that there's consistency throughout a given work.
The real trick is that the AI needs to be able to participate in iteration cycles, where the human can say "okay this is all mostly good, but I've circled some areas that don't look quite right and described what needs to be different about them." As far as I've played with it, current AIs aren't very good at revisiting their own work— you're basically just tweaking the original inputs and otherwise starting over from scratch each time.
We will shortly have much better tweaking tools which work not only on images and video but concepts like what aspects a character should exhibit. See for example the presentation from Shapeshift Labs.
This isn't a matter of human-level AI or superhuman-level AI; it's just straight up impossible. If you want the information to match, it has to be provided. If it isn't there, an AI can fill in the gaps with "something" that will make the scene work, but expecting it to fill in the gaps the way you "want" even though you gave it no indication of what that is is expecting literal magic.
Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible. Some sort of long-form "write me a horror movie staring a precocious 22-year old elf in a far-future Ganymede colony with a message about the importance of friendship" AI that generates a coherent movie of many scenes will have to be doing a lot of some sort of internal communication in an internal language to hold the result together between scenes, because what it takes to hold stuff coherent between scenes is an amount of English text not entirely dissimilar in size from the underlying representation itself. You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.