Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NeRF: Representing scenes as neural radiance fields for view synthesis (matthewtancik.com)
237 points by dfield on March 20, 2020 | hide | past | favorite | 40 comments


This is absolutely stunning.

As they say in ML, representation first -- and this is one of the most natural and elegant ways to represent 3D scenes and subjective viewpoints. Great that it works into a rendering environment such that it's E2E differentiable.

This is the first leap toward true high-quality real-time ML-based rendering. I'm blown away.


Very cool. Reminds me of when I played with Google's Seurat.

The paper says its 5MB, 12 hours to train the NN and then 30 seconds to render novel views of the scene on an nVidia V100.

Sadly not something you can use in real time but still very cool.

Edit:12 hours and 5MB NN not 5 Minutes


Huh, what? It needs almost a million views, and takes 1-2 days to train on a GPU. I’m not sure where the “5 minutes” number comes from.

EDIT: I was referring to the last paragraph of section 5.3 (Implementation details), but maybe I’m misunderstanding how they use rays / sampled coordinates.

Very impressive visual quality. But it seems like they need a LOT of data and computation for each scene. So, its still plausible that intelligently done photogrammetry will beat this approach in efficiency, but a bunch of important details need to be figured out to make that happen.


Excuse me I meant 5MB. It takes 12 hours to train.

>All compared single scene methods take at least 12 hours to train per scene

But it seems to only need sparse images.

>Here, we visualize the set of 100 input views of the synthetic Drums scene randomly captured on a surrounding hemisphere, and we show two novel views rendered from our optimized NeRF representation


> It needs almost a million views

Not sure what you mean by "views". The comparisons in the paper use at most 100 input images per scene.


A pixel is one view for their model if I understand correctly, so one hundred 100x100 images would be a million views.


Well that took some effort just to work out what they actually did. How they actually did it I have no idea. Impressive however - a sort of fill in the blanks for the bits that are missing. If our brains dont do this one would be surprised.

And we are all supposed to become AI developers this decade?!

Come back Visual Basic all is forgiven :-)


This blows my mind. This is probably a naive thought; This technique looks like it could be combined with robotics to help it navigate through its environment.

I'd also like to see what it does when you give it multiple views of scenes in a video game. Some from the direct pictures and some from pictures of the monitor.


They've only showed it working with static content - they'll need to do it with video (multiple synchronised cameras) and in real time for ant robotics application.


It'd be interesting to see what happened if they encoded an additional time parameter on each 'view' (input image pixel). Surely someone is already trying to extend this technique that way.


Currently view coordinates relative to the volume are required so you first have to solve the SLAM problem before you can optimize a network representation of a given volume.


It takes 12 hours on a high end GPU to make one frame.


No, as appendix A of the paper states, each frame takes about 30 seconds to render.


No, the high dimensional field takes 12 hours and the time to render the field to an image is not going to matter for robotics where computer vision needs to be done in real time.


This is bad-ass, partly because it's so elegant.


Could someone ELI5, please?


If you give it a bunch of photos of a scene from different angles, this machine learning method lets you see angles that did not exist in the original set.

Better results than other methods so far.


Fist bump for actually answering as ELI5 (unlike the other responses).


So can we take it to the next level and give it a bunch of ML-generated photos of a scene that doesn't exist (from model B) and let this model A create the 3D view?

Take it one more step further and make model B create photos from some text description similar to the one described in https://news.ycombinator.com/item?id=22640407 (although that one does 3D designs using voxels)


its a very similar concept to photogrammetry which is recovering a 3d representation of an object given pictures taken from different angles.

In this work they take pictures of a scene from different angles and are able to train a neural network to render the scene from new angles that aren't in any source pictures.

The neural network takes in a location (x,y,z), a viewing direction and spits out the RGB of the rendered image if you were to view the scene at that location and angle.

Using this network and traditional rendering techniques they are able to render the whole scene.


Significantly, the input is a sparse dataset.

ie. Few source images vs. traditional photogrammetry.

...but basically yes, tldr; photogrammetry using neural networks; this one is better than other recent attempts at the same thing, but takes a really long time (2 days for this vs 10 minutes for a voxel based approach in one of their comparisons).

Why bother?

mmm... theres some kind speculation you might be able to represent a photorealistic scene/ 3d object as a neural model instead of voxels or meshes.

That might be useful for some things. eg. say, a voxel representation of semi transparent fog, or high detail objects like hair are impractically huge, and as a mesh its very difficult to represent.


A number of things this seems to do well would be pretty much impossible with standard photogrammetry : trees with leaves, fine details like rigging on a ship, reflective surfaces, even refraction (!)

Of course the output is a new view, not a shaded mesh, but given it appears to generate depth data, I think you should be able to generate a point cloud and mesh it. Getting the materials from the output light even be possible, I'm not very up to date on the state of material capture nowadays.


> Significantly, the input is a sparse dataset. ie. Few source images vs. traditional photogrammetry.

This uses dozens or hundreds of images, which isn't usually necessary for traditional photogrammetry that maps photos to hard surfaces with textures.

I think what you noted about volumes is the significant part. Complex objects with fine detail and view dependent reflections are the part that shines here over photogrammetry, but it does take a lot of images. I didn't see anything in the paper that dealt with transparency.


> Why bother?

There might be 10x speedups to be gained with a tweaked model.


They're modeling a scene mathematically as a "radiance field" - a function that takes a view position and direction as inputs and returns the light color that hits that position from the direction it's facing. They use some input images to train a neural network, in order to find an optimal radiance field function which explains the input images. Once they have that function, they can construct images from new angles by evaluating the function over the (position, direction) inputs needed by the pixels in the new image.


>Could someone ELI5, please?

Smart, high-dimensional interpolator.


Wow great


This would be great for instant replays


Intel already does this with their "True View" setup. They also had a tech demo CES where they synthesized camera positions for movie sets. https://www.youtube.com/watch?v=9qd276AJg-o


The neural networks representing these scenes take up just 5 MB... Less than the input images used to train them. Wow. Mind blowing!


Keep in mind though, that the way the data is represented is a form of lossy compression and the size of the images may not be.


If you're only looking for one novel view, can it use less views that are close to the novel one?


Does anyone know how they do the “virtual object insertion” demonstrated in the paper summary video? Can that be somehow done on the network itself, or is that a diagnostic for scene accuracy by performing SFM on network output?


I'm pretty sure they're rendering a depth channel and compositing it in.


You could do that, but I think it's simpler to just introduce additional objects during the raytracing process that generates the images. That would produce accurate results even with semitransparent objects, unlike compositing with an depth buffer.


I would like to see “neural enhances” an already rendered 3D scene with the changes which would make it more realistic, given depth map and other information to the neural network


How would it be made more realistic?


This is REALLY cool, but kinda makes sense as well. Neural networks are very good at interpolation, given the right prior.


This is the kind of shit I come here for. Awesome post! Thanks for sharing!


This is like the Blade Runner ingame tool.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: