NeRF: Representing scenes as neural radiance fields for view synthesis

uoaei · on March 20, 2020

This is absolutely stunning.

As they say in ML, representation first -- and this is one of the most natural and elegant ways to represent 3D scenes and subjective viewpoints. Great that it works into a rendering environment such that it's E2E differentiable.

This is the first leap toward true high-quality real-time ML-based rendering. I'm blown away.

jayd16 · on March 20, 2020

Very cool. Reminds me of when I played with Google's Seurat.

The paper says its 5MB, 12 hours to train the NN and then 30 seconds to render novel views of the scene on an nVidia V100.

Sadly not something you can use in real time but still very cool.

Edit:12 hours and 5MB NN not 5 Minutes

ssivark · on March 20, 2020

Huh, what? It needs almost a million views, and takes 1-2 days to train on a GPU. I’m not sure where the “5 minutes” number comes from.

EDIT: I was referring to the last paragraph of section 5.3 (Implementation details), but maybe I’m misunderstanding how they use rays / sampled coordinates.

Very impressive visual quality. But it seems like they need a LOT of data and computation for each scene. So, its still plausible that intelligently done photogrammetry will beat this approach in efficiency, but a bunch of important details need to be figured out to make that happen.

jayd16 · on March 20, 2020

Excuse me I meant 5MB. It takes 12 hours to train.

>All compared single scene methods take at least 12 hours to train per scene

But it seems to only need sparse images.

>Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation

scribu · on March 20, 2020

> It needs almost a million views

Not sure what you mean by "views". The comparisons in the paper use at most 100 input images per scene.

bla3 · on March 21, 2020

A pixel is one view for their model if I understand correctly, so one hundred 100x100 images would be a million views.

lifeisstillgood · on March 20, 2020

Well that took some effort just to work out what they actually did. How they actually did it I have no idea. Impressive however - a sort of fill in the blanks for the bits that are missing. If our brains dont do this one would be surprised.

And we are all supposed to become AI developers this decade?!

Come back Visual Basic all is forgiven :-)

raidicy · on March 20, 2020

This blows my mind. This is probably a naive thought; This technique looks like it could be combined with robotics to help it navigate through its environment.

I'd also like to see what it does when you give it multiple views of scenes in a video game. Some from the direct pictures and some from pictures of the monitor.

yarg · on March 21, 2020

They've only showed it working with static content - they'll need to do it with video (multiple synchronised cameras) and in real time for ant robotics application.

macawfish · on March 21, 2020

It'd be interesting to see what happened if they encoded an additional time parameter on each 'view' (input image pixel). Surely someone is already trying to extend this technique that way.

iandanforth · on March 20, 2020

Currently view coordinates relative to the volume are required so you first have to solve the SLAM problem before you can optimize a network representation of a given volume.

BubRoss · on March 21, 2020

It takes 12 hours on a high end GPU to make one frame.

teraflop · on March 21, 2020

No, as appendix A of the paper states, each frame takes about 30 seconds to render.

BubRoss · on March 21, 2020

No, the high dimensional field takes 12 hours and the time to render the field to an image is not going to matter for robotics where computer vision needs to be done in real time.

teknopurge · on March 20, 2020

This is bad-ass, partly because it's so elegant.

blackhaz · on March 20, 2020

Could someone ELI5, please?

mooneater · on March 20, 2020

If you give it a bunch of photos of a scene from different angles, this machine learning method lets you see angles that did not exist in the original set.

Better results than other methods so far.

notfed · on March 20, 2020

Fist bump for actually answering as ELI5 (unlike the other responses).

airstrike · on March 20, 2020

So can we take it to the next level and give it a bunch of ML-generated photos of a scene that doesn't exist (from model B) and let this model A create the 3D view?

Take it one more step further and make model B create photos from some text description similar to the one described in https://news.ycombinator.com/item?id=22640407 (although that one does 3D designs using voxels)

quadrature · on March 20, 2020

its a very similar concept to photogrammetry which is recovering a 3d representation of an object given pictures taken from different angles.

In this work they take pictures of a scene from different angles and are able to train a neural network to render the scene from new angles that aren't in any source pictures.

The neural network takes in a location (x,y,z), a viewing direction and spits out the RGB of the rendered image if you were to view the scene at that location and angle.

Using this network and traditional rendering techniques they are able to render the whole scene.

wokwokwok · on March 20, 2020

Significantly, the input is a sparse dataset.

ie. Few source images vs. traditional photogrammetry.

...but basically yes, tldr; photogrammetry using neural networks; this one is better than other recent attempts at the same thing, but takes a really long time (2 days for this vs 10 minutes for a voxel based approach in one of their comparisons).

Why bother?

mmm... theres some kind speculation you might be able to represent a photorealistic scene/ 3d object as a neural model instead of voxels or meshes.

That might be useful for some things. eg. say, a voxel representation of semi transparent fog, or high detail objects like hair are impractically huge, and as a mesh its very difficult to represent.

rebuilder · on March 20, 2020

A number of things this seems to do well would be pretty much impossible with standard photogrammetry : trees with leaves, fine details like rigging on a ship, reflective surfaces, even refraction (!)

Of course the output is a new view, not a shaded mesh, but given it appears to generate depth data, I think you should be able to generate a point cloud and mesh it. Getting the materials from the output light even be possible, I'm not very up to date on the state of material capture nowadays.

BubRoss · on March 21, 2020

> Significantly, the input is a sparse dataset. ie. Few source images vs. traditional photogrammetry.

This uses dozens or hundreds of images, which isn't usually necessary for traditional photogrammetry that maps photos to hard surfaces with textures.

I think what you noted about volumes is the significant part. Complex objects with fine detail and view dependent reflections are the part that shines here over photogrammetry, but it does take a lot of images. I didn't see anything in the paper that dealt with transparency.

visarga · on March 20, 2020

> Why bother?

There might be 10x speedups to be gained with a tweaked model.

type_enthusiast · on March 20, 2020

They're modeling a scene mathematically as a "radiance field" - a function that takes a view position and direction as inputs and returns the light color that hits that position from the direction it's facing. They use some input images to train a neural network, in order to find an optimal radiance field function which explains the input images. Once they have that function, they can construct images from new angles by evaluating the function over the (position, direction) inputs needed by the pixels in the new image.

ur-whale · on March 20, 2020

>Could someone ELI5, please?

Smart, high-dimensional interpolator.

imposter · on March 20, 2020

Wow great

kuprel · on March 20, 2020

This would be great for instant replays

jayd16 · on March 20, 2020

Intel already does this with their "True View" setup. They also had a tech demo CES where they synthesized camera positions for movie sets. https://www.youtube.com/watch?v=9qd276AJg-o

macawfish · on March 21, 2020

The neural networks representing these scenes take up just 5 MB... Less than the input images used to train them. Wow. Mind blowing!

BubRoss · on March 21, 2020

Keep in mind though, that the way the data is represented is a form of lossy compression and the size of the images may not be.

byt143 · on March 20, 2020

If you're only looking for one novel view, can it use less views that are close to the novel one?

ssivark · on March 20, 2020

Does anyone know how they do the “virtual object insertion” demonstrated in the paper summary video? Can that be somehow done on the network itself, or is that a diagnostic for scene accuracy by performing SFM on network output?

theresistor · on March 20, 2020

I'm pretty sure they're rendering a depth channel and compositing it in.

teraflop · on March 20, 2020

You could do that, but I think it's simpler to just introduce additional objects during the raytracing process that generates the images. That would produce accurate results even with semitransparent objects, unlike compositing with an depth buffer.

philip368320 · on March 21, 2020

I would like to see “neural enhances” an already rendered 3D scene with the changes which would make it more realistic, given depth map and other information to the neural network

BubRoss · on March 21, 2020

How would it be made more realistic?

tanilama · on March 21, 2020

This is REALLY cool, but kinda makes sense as well. Neural networks are very good at interpolation, given the right prior.

2OEH8eoCRo0 · on March 21, 2020

This is the kind of shit I come here for. Awesome post! Thanks for sharing!

anthk · on March 21, 2020

This is like the Blade Runner ingame tool.