As they say in ML, representation first -- and this is one of the most natural and elegant ways to represent 3D scenes and subjective viewpoints. Great that it works into a rendering environment such that it's E2E differentiable.
This is the first leap toward true high-quality real-time ML-based rendering. I'm blown away.
Huh, what? It needs almost a million views, and takes 1-2 days to train on a GPU. I’m not sure where the “5 minutes” number comes from.
EDIT: I was referring to the last paragraph of section 5.3 (Implementation details), but maybe I’m misunderstanding how they use rays / sampled coordinates.
Very impressive visual quality. But it seems like they need a LOT of data and computation for each scene. So, its still plausible that intelligently done photogrammetry will beat this approach in efficiency, but a bunch of important details need to be figured out to make that happen.
Excuse me I meant 5MB. It takes 12 hours to train.
>All compared single scene methods take at least 12 hours to train per scene
But it seems to only need sparse images.
>Here, we visualize the set of 100 input views of the
synthetic Drums scene randomly captured on a surrounding hemisphere, and we
show two novel views rendered from our optimized NeRF representation
Well that took some effort just to work out what they actually did. How they actually did it I have no idea. Impressive however - a sort of fill in the blanks for the bits that are missing. If our brains dont do this one would be surprised.
And we are all supposed to become AI developers this decade?!
This blows my mind. This is probably a naive thought; This technique looks like it could be combined with robotics to help it navigate through its environment.
I'd also like to see what it does when you give it multiple views of scenes in a video game. Some from the direct pictures and some from pictures of the monitor.
They've only showed it working with static content - they'll need to do it with video (multiple synchronised cameras) and in real time for ant robotics application.
It'd be interesting to see what happened if they encoded an additional time parameter on each 'view' (input image pixel). Surely someone is already trying to extend this technique that way.
Currently view coordinates relative to the volume are required so you first have to solve the SLAM problem before you can optimize a network representation of a given volume.
No, the high dimensional field takes 12 hours and the time to render the field to an image is not going to matter for robotics where computer vision needs to be done in real time.
If you give it a bunch of photos of a scene from different angles, this machine learning method lets you see angles that did not exist in the original set.
So can we take it to the next level and give it a bunch of ML-generated photos of a scene that doesn't exist (from model B) and let this model A create the 3D view?
Take it one more step further and make model B create photos from some text description similar to the one described in https://news.ycombinator.com/item?id=22640407 (although that one does 3D designs using voxels)
its a very similar concept to photogrammetry which is recovering a 3d representation of an object given pictures taken from different angles.
In this work they take pictures of a scene from different angles and are able to train a neural network to render the scene from new angles that aren't in any source pictures.
The neural network takes in a location (x,y,z), a viewing direction and spits out the RGB of the rendered image if you were to view the scene at that location and angle.
Using this network and traditional rendering techniques they are able to render the whole scene.
ie. Few source images vs. traditional photogrammetry.
...but basically yes, tldr; photogrammetry using neural networks; this one is better than other recent attempts at the same thing, but takes a really long time (2 days for this vs 10 minutes for a voxel based approach in one of their comparisons).
Why bother?
mmm... theres some kind speculation you might be able to represent a photorealistic scene/ 3d object as a neural model instead of voxels or meshes.
That might be useful for some things. eg. say, a voxel representation of semi transparent fog, or high detail objects like hair are impractically huge, and as a mesh its very difficult to represent.
A number of things this seems to do well would be pretty much impossible with standard photogrammetry : trees with leaves, fine details like rigging on a ship, reflective surfaces, even refraction (!)
Of course the output is a new view, not a shaded mesh, but given it appears to generate depth data, I think you should be able to generate a point cloud and mesh it. Getting the materials from the output light even be possible, I'm not very up to date on the state of material capture nowadays.
> Significantly, the input is a sparse dataset.
ie. Few source images vs. traditional photogrammetry.
This uses dozens or hundreds of images, which isn't usually necessary for traditional photogrammetry that maps photos to hard surfaces with textures.
I think what you noted about volumes is the significant part. Complex objects with fine detail and view dependent reflections are the part that shines here over photogrammetry, but it does take a lot of images. I didn't see anything in the paper that dealt with transparency.
They're modeling a scene mathematically as a "radiance field" - a function that takes a view position and direction as inputs and returns the light color that hits that position from the direction it's facing. They use some input images to train a neural network, in order to find an optimal radiance field function which explains the input images. Once they have that function, they can construct images from new angles by evaluating the function over the (position, direction) inputs needed by the pixels in the new image.
Intel already does this with their "True View" setup. They also had a tech demo CES where they synthesized camera positions for movie sets. https://www.youtube.com/watch?v=9qd276AJg-o
Does anyone know how they do the “virtual object insertion” demonstrated in the paper summary video? Can that be somehow done on the network itself, or is that a diagnostic for scene accuracy by performing SFM on network output?
You could do that, but I think it's simpler to just introduce additional objects during the raytracing process that generates the images. That would produce accurate results even with semitransparent objects, unlike compositing with an depth buffer.
I would like to see “neural enhances” an already rendered 3D scene with the changes which would make it more realistic, given depth map and other information to the neural network
As they say in ML, representation first -- and this is one of the most natural and elegant ways to represent 3D scenes and subjective viewpoints. Great that it works into a rendering environment such that it's E2E differentiable.
This is the first leap toward true high-quality real-time ML-based rendering. I'm blown away.