Why are researchers focusing on single-image reconstruction? It seems like a par...

fxtentacle · on Aug 4, 2023

They likely focus on the 1 image case because there it's the easiest to show visual progress over the competition. If you have 50+ images, the tech from 10 years ago is already good enough to get Hollywood-quality 3D scans.

From what I understand, they use the text keywords detected from the image as guidance and they also apply a loss between the current diffusion state and the source image. In effect, this is stable diffusion for 3D shapes but with clever conditioning. That means this algorithm will also work just fine if you have 2+ input images.

wkat4242 · on Aug 4, 2023

True but 50+ images only work if they're good quality.

What would really make a difference if you could make a decent model from 30 images snapped by a drunk teenager on their $200 budget phone in bad lighting.

If money and resources are not an object then yeah it's easy but for most people it is.

fxtentacle · on Aug 4, 2023

Renting a matterport pro is like $200 per day, so that should be well within the budget of most hobbyists

wkat4242 · on Aug 4, 2023

Yeah but it becomes a totally different thing if you have it in your pocket whenever needed.

The best camera is the one you have with you, and the same goes for this too.

timlod · on Aug 4, 2023

Do you have any source for scene reconstruction being hollywood-quality given 50+ images 10 years ago (assuming the images don't come from a controlled environment)? I can find for example this from 2016: https://substance3d.adobe.com/magazine/go-scan-the-world-pho... However, there's still a lot of manual work involved, and you don't easily get near-perfect PBR textures (which would be what I'd consider hollywood-quality). I'd say the devil is in the details - if you want the best quality, you'd still need to control a lot of the environment.

The more assumptions you can bake into the parameters of some model, the more degrees of freedom you get in the actual measurement process (e.g. reducing the amount of actual data necessary).

fxtentacle · on Aug 4, 2023

Agisoft PhotoScan was publicly released in 2010 and I know someone who used the beta for scenes in a movie that hit cinemas in 2014.

But yes, the images totally came from a controlled environment. They rented like 50 similar cameras and hardware-synchronized the shutters.

vlovich123 · on Aug 4, 2023

Because there’s people working on all sorts of different problems and solving a problem in one area can apply better to some problems than others. Not to mention that solution approaches can often cross pollinate.

Research is additive not a zero sum thing.

lelanthran · on Aug 4, 2023

Party trick? To me this looks like a game changer for developing 3D art for videogames.

It's not there yet for AAA games, but will get there in time. In the meantime, right now as it is, it's could save days of time for the indie game dev.

danuker · on Aug 4, 2023

Indeed; the entertainment industry doesn't need CAD-level accuracy.

colordrops · on Aug 4, 2023

It's not impossible, as you can find a good artist that can do it. If a human can do it, then there's at least a chance that ML can do it as well. Unless there is something about the problem space that I'm misunderstansing.

zdkl · on Aug 4, 2023

https://en.wikipedia.org/wiki/Photogrammetry

https://colmap.github.io/ Or, more modern: https://alicevision.org/

elif · on Aug 4, 2023

As long as the voxels are weighted by some sort of confidence value, generating a high quality 3D model based upon a dozen or so goodish quality models is a trivial fitting task.

The problem with the multi-perspective approaches are: to do it well you need lasers and extremely stable AND perfectly localized observation points, making it slow, expensive and fragile for robotic applications, OR the image-to-model component of it is a grotesque anti-physics artifact hallucination which reinforces entropy as much as it does valid data leading to shitty models.

Essentially, this single image to 3D step is the key to both forms.

kordlessagain · on Aug 4, 2023

Our minds build internal models from what we are seeing. What you perceive of the world is pulled from this model, which is "running" in mind and is apparently 3D.

To effectively mirror this process in machines, we need models that can extrapolate a 3D environment and it's objects from sensors. If we can get an approximation of what something looks like from a different perspective, it's valuable for naviatigation and objective planning.

We could also theorize about the effects it will have on machine's awareness and attention.

washadjeffmad · on Aug 4, 2023

Scale. You have to be able to do it once to be able to do it at all.

This is just for creative/artistic use, because you obviously don't want software to "guess" on something used for engineering. In that case, you would use an appropriate 3D scan.

Also, 3D assets aren't generally created as scenes. A 3D-2D-3D pipeline also has wide application in AR, VR, standalone modeling, animation, etc.

jasonjamerson · on Aug 4, 2023

It's incredibly useful: you could pull a 3D model from an old photo, or even a painting, clean it up a bit, and use it.

PeterStuer · on Aug 4, 2023

Sometimes you want to construct a 3D model from a single image. E.g. Stable Diffusion is pretty good at generating images of original miniatures for gaming. Will be interesting to see how SD to Magic123 to 3D print will work for this.

codetrotter · on Aug 4, 2023

In their example video they use an illustration of Iron Man.

Seems valuable to be able to generate 3d models from single 2d illustrations, for games and other media.

And there are probably heaps of other applications as well.

inconceivable · on Aug 4, 2023

yeah, it's totally useless.... unless of course, you have only one image of the thing you're trying to model. and getting a second image will take millions of dollars or another trip around the earth from orbit.

lmao are you for real?