Code golf task: implement the whole pipeline above in minimum amount of (existing as of now) ComfyUI nodes.
Extra challenge: extend that to produce videos (e.g. via "live portrait" nodes/models), to implement the digital version of the magic paintings (and newspaper photos) from Harry Potter.
EDIT:
I'm not joking. This feels like a weekend challenge today; "live portraits" in particular work fast today on a half-decent consumer GPU, like my RTX 4070 Ti (the old one, not Super), and I believe (but haven't tested yet) even training a LoRA from a couple dozen images is reasonably doable locally too.
In general, my experience with Stable Diffusion and ComfyUI is that, for fully local scenario on normal person's hardware (i.e. not someone's totally normal PC that happens to have eight 30xx GPUs in a cluster), the capabilities and speed are light years ahead of LLM space.
Just for comparison, yesterday I - like half the techies on the planet - got to run me some local DeepSeek-R1. The 1.58 bit dynamic quant topped at 0.16 tokens per second. It's about the same as it takes a SD1.5 derivative to generate me a decent-looking HD image. I could probably get them running parallel in lock-step (SD on GPU, compute-bound; DeepSeek on CPU, RAM-bandwidth bound) and get one image per LLM token.
Can you explain more about comfy ui? I heard it could work for running inference locally, but I couldn't get it running because I don't have Nvidia GPUs. Does it only work if you still have those?
I only use it on Windows with Nvidia GPU, but it should work both on Windows and Linux with CPU only and with Intel GPUs, as well as on Linux only with AMD. Though skimming the README some more, I also see Apple Silicon section, and one called "DirectML (AMD Cards on Windows)", so maybe AMD+Win works too.
As for use: you install ComfyUI from the link above, and then this: