Author here: We were blown away too. This project started with a question in our minds about whether it was even possible for the stable diffusion model architecture to output something with the level of fidelity needed for the resulting audio to sound reasonable.
Excellent work! Singing would be amazing - karaoke can finally sound good :p
Have you released a tool for volumetric capture? I'm applying this to LED lighting fixture setup for tv/film/live shows and 3D positioning is the last step to fully automated configuration.
My goal is real-time sync between 3D model and real world.