Why are researchers focusing on single-image reconstruction? It seems like a party trick which isn't very useful, and pretty much impossible to reconstruct the original object accurately. It would be much more useful if many images from different angles could be used. Somewhat like NERF, but also predict missing views with 2D diffusion. Adding more images would get the model closer to ground truth.
They likely focus on the 1 image case because there it's the easiest to show visual progress over the competition. If you have 50+ images, the tech from 10 years ago is already good enough to get Hollywood-quality 3D scans.
From what I understand, they use the text keywords detected from the image as guidance and they also apply a loss between the current diffusion state and the source image. In effect, this is stable diffusion for 3D shapes but with clever conditioning. That means this algorithm will also work just fine if you have 2+ input images.
True but 50+ images only work if they're good quality.
What would really make a difference if you could make a decent model from 30 images snapped by a drunk teenager on their $200 budget phone in bad lighting.
If money and resources are not an object then yeah it's easy but for most people it is.
Do you have any source for scene reconstruction being hollywood-quality given 50+ images 10 years ago (assuming the images don't come from a controlled environment)?
I can find for example this from 2016: https://substance3d.adobe.com/magazine/go-scan-the-world-pho...
However, there's still a lot of manual work involved, and you don't easily get near-perfect PBR textures (which would be what I'd consider hollywood-quality).
I'd say the devil is in the details - if you want the best quality, you'd still need to control a lot of the environment.
The more assumptions you can bake into the parameters of some model, the more degrees of freedom you get in the actual measurement process (e.g. reducing the amount of actual data necessary).
Because there’s people working on all sorts of different problems and solving a problem in one area can apply better to some problems than others. Not to mention that solution approaches can often cross pollinate.
Party trick? To me this looks like a game changer for developing 3D art for videogames.
It's not there yet for AAA games, but will get there in time. In the meantime, right now as it is, it's could save days of time for the indie game dev.
It's not impossible, as you can find a good artist that can do it. If a human can do it, then there's at least a chance that ML can do it as well. Unless there is something about the problem space that I'm misunderstansing.
As long as the voxels are weighted by some sort of confidence value, generating a high quality 3D model based upon a dozen or so goodish quality models is a trivial fitting task.
The problem with the multi-perspective approaches are: to do it well you need lasers and extremely stable AND perfectly localized observation points, making it slow, expensive and fragile for robotic applications, OR the image-to-model component of it is a grotesque anti-physics artifact hallucination which reinforces entropy as much as it does valid data leading to shitty models.
Essentially, this single image to 3D step is the key to both forms.
Our minds build internal models from what we are seeing. What you perceive of the world is pulled from this model, which is "running" in mind and is apparently 3D.
To effectively mirror this process in machines, we need models that can extrapolate a 3D environment and it's objects from sensors. If we can get an approximation of what something looks like from a different perspective, it's valuable for naviatigation and objective planning.
We could also theorize about the effects it will have on machine's awareness and attention.
Scale. You have to be able to do it once to be able to do it at all.
This is just for creative/artistic use, because you obviously don't want software to "guess" on something used for engineering. In that case, you would use an appropriate 3D scan.
Also, 3D assets aren't generally created as scenes. A 3D-2D-3D pipeline also has wide application in AR, VR, standalone modeling, animation, etc.
Sometimes you want to construct a 3D model from a single image. E.g. Stable Diffusion is pretty good at generating images of original miniatures for gaming. Will be interesting to see how SD to Magic123 to 3D print will work for this.
yeah, it's totally useless.... unless of course, you have only one image of the thing you're trying to model. and getting a second image will take millions of dollars or another trip around the earth from orbit.