Software converts 360 video into 3D model for VR

nkurz · on Nov 20, 2017

Nowadays, it is possible to create a 3D model of a space by taking hundreds of photographs of a space and using computers to analyze those photos through photogrammetry and computer vision methods. This is a laborious process, and is especially time-consuming and resource-intensive for large scenes.

I was recently trying to understand what the state of the art was here, and was surprised to learn that this is true: most 3D reconstruction fares better with a smaller number of high-resolution still photos than with a larger number of lower resolution "stills" extracted from video. I'm still somewhat confused why this is the case.

My limited understanding is that main advantage of working with stills is that they more commonly contain location tagging, while frames from video do not. Compared to the base level of difficulty of building a accurate 3D model from lossy 2D sources, figuring out the trajectory of camera doesn't seem too hard. Once one has this, wouldn't the video be just as easier to work with?

Trying to figure this out this discrepancy, I got the impression that many researchers may actually be trying to solve the harder problem of trying to compute the camera trajectory in real time, as might be needed for a self-driving vehicle: http://webdiis.unizar.es/~raulmur/orbslam/. This indeed does seem harder, but still leaves me wondering why a multipass approach wouldn't be feasible.

What am I missing? Why can't one do a first pass to calculate camera trajectory, possibly a second pass to combine temporally adjacent frames for greater resolution, then create a better model from the resulting wealth of data? Alternatively stated, why does the quality of the 3D model seem to depend more on the resolution and quality of the input 2D images rather than on the number of these images?

angry_octet · on Nov 20, 2017

What you are talking about is called Monocular SLAM. The process of calculating the trajectory inherently maps the world at the same time by estimating feature positions by inter-frame distances.

It really isn't a problem to use video frames, but they have different types of error/noise to high res stills. For example, the video camera is moving, and it has rolling shutter, so every pixel (or pixel row) is from a different orientation. If this isn't accounted for it introduces noise, but doing so takes more/better processing. But video frames have consistent intre-frame delay and good coherence, so it is easier to match features frame-by-frame than from a heap of individual photos. If you have good (hot GPS) positions then that makes both situations easier.

Model visual quality is improved by better resolution pictures in a fairly straightforward way... But you can get the same data by physically moving a lower res camera closer to all surfaces. It is just harder to scan quickly and maintain track if you have to do that. Some people even put lenses in front of their depth camera to zoom in tighter, for capturing finer detail.

ndh2 · on Nov 20, 2017

I've been wondering about this as well. Why are they using so few images? To me that is like reverse blinking: Instead of keeping your eyes open, and blinking for a short amount of time every now and then, you're walking around with your eyes closed, and open them only for a short time.

I believe the main problem is knowing which data to trust and which data to discard. Reflections (water, shiny surfaces), moving objects (leaves in the wind), over-exposure, and lens flares are already pretty hard to deal with. But with low resolution data, you make it even more difficult because there's more data to discard for being too inaccurate.

namlem · on Nov 20, 2017

Actually, that's not how our eyes work at all. In fact, the "reverse blinking" you describe is much closer to the truth. The retina has a very small region of high acuity that darts around your visual field, essentially taking snapshots that your brain stitches together. A lot of what we perceive as movement is actually our brains predicting trajectories.

ghusbands · on Nov 21, 2017

No, a lot of what we perceive as movement is our eyes seeing movement. Motion is perceived across the whole of our field of vision, not just by the fovea (the "very small region of high acuity" you mentioned). Try not to spread misinformation.

rasz · on Nov 21, 2017

We had pretty solid location from motion for almost 10 years now. PTAM, realtime on a laptop in 2008 http://www.robots.ox.ac.uk/~gk/PTAM/

rawnlq · on Nov 21, 2017

> combine temporally adjacent frames for greater resolution

Even just combining multiple still photos to produce a higher resolution photo is itself a hard problem in the field of super-resolution imaging. Google has a consumer product for it (mostly for removing glare): https://www.youtube.com/watch?v=MEyDt0DNjWU&feature=youtu.be....

Keyframe · on Nov 20, 2017

Maybe motion blur has something to do with it? I've been using photogrammetry techniques occasionally, when I needed it, in film/3d work. One thing I didn't try out, and what I wanted and you've reminded me of it now is to try going through a space with RED camera, crank the shutter so to eliminate motion blur and maybe even go with the HDRx option (for textures, later on). I'll see what Photoscan comes up with.

cwe · on Nov 20, 2017

Reminds me of parts of photosynth[1], glad to see people are still working on this kind of thing. Was very disappointed that photosynth didn't really pan out with building 3d models off the images, they went with just panoramas.

[1] https://en.wikipedia.org/wiki/Photosynth

greggman · on Nov 21, 2017

I guess not quite the same but the majority of 3d models in Google maps 3d mode are autogenerated

duiker101 · on Nov 20, 2017

This is really nice but PLEASE don't put an image of a youtube video with the play button at the beginning of the article just to post the video later... it took me way too long to realize I was clicking on an image.

residude · on Nov 21, 2017

I clicked 15 times until using inspect element and realized no link in it.

PeachPlum · on Nov 20, 2017

Try Autodesk's Recap

https://www.autodesk.com/products/recap/overview

sbarre · on Nov 20, 2017

This seems to require some serious non-hobbyist hardware to generate the data needed to build the models, but the results are more impressive than this demo for sure!

PeachPlum · on Nov 20, 2017

We've made models using a cell phone camera

sbarre · on Nov 20, 2017

Interesting! I'd love to know more about that..

PeachPlum · on Nov 20, 2017

I'll try and post some images tomorrow, my colleague has them

mcoliver · on Nov 20, 2017

Check out Reality Capture by Capturing Reality (yeah..really..they reversed the name of the company and the software). https://www.capturingreality.com

We have used this in conjunction with screen captures from google earth to regenerate environments

GoToRO · on Nov 20, 2017

Very nice. It needs two 360 videos.

drcross · on Nov 20, 2017

Things like this give hefty credence to simulation theory.

qume · on Nov 20, 2017

This would have been news in 2005

sbarre · on Nov 20, 2017

Do you have a source for something similar from back then?

Honest question because I keep up with this field and this seemed pretty novel to me (at least the quasi-DIY aspect of it)

willvarfar · on Nov 20, 2017

The underlying problem is termed "motion matching", "motion tracking" or "camera tracking".

I played around with it around 2009 or so and at that time there was scant open-source libs for bits and nothing to join it together and nothing that worked particularly well; it was the domain of high-quality high-price niche products aimed at Hollywood which you could watch product videos of on YouTube but not actually afford to use.

I don't know how accessible it is now; maybe there are now working quality open-source libs and we can all start putting it into raspberry pi robots? ;)

danellis · on Nov 20, 2017

You're talking about something different. This is a lot more than motion tracking. This is creating 3D models, not just a 3D path.

willvarfar · on Nov 21, 2017

True! Creating a point cloud is the first step. The next step is to turn it into a mesh with faces.

qume · on Nov 21, 2017

Snavely released bundler 9 years ago: http://www.cs.cornell.edu/~snavely/bundler/

There were a bunch of other papers around the time.

I was downvoted a bunch here - which is odd as this really would have been news in 2005 when this would have been considered state of the art.

Right now there is nothing here which wasn't published already a decade ago.

Disclosure: I've been working on structure from motion software for the last decade.

nkurz · on Nov 21, 2017

Right now there is nothing here which wasn't published already a decade ago.

I downvoted your initial comment, but because I thought it was unhelpful rather than because I thought it was untrue. By contrast, I upvoted your more recent comment that mentions your expertise and defends your view with a useful link to decade old software.

Still, I'd guess that for many people outside your field, "would have been news" is not the same as with "published already a decade ago". I'd guess the majority are interested in what's currently achievable using off-the-shelf hardware and ready-to-run software, and aren't bothered that it may be weak in theoretical advances. Alternatively phrased, people may consider the performance and availability newsworthy even if the theory isn't cutting edge.

That said, I'm familiar with neither the state of the art nor the state of the theory. Are you saying that you could strap the same consumer camera rig to your head, take an unplanned stroll through a forest or city, and achieve the same model quality by running the resulting video through Bundler? If so, you would have a strong case that the parent article is accepting the hype of the press release a little too easily.

qume · on Nov 21, 2017

You nailed it. You could use pix4d, photoscan, bundler + pmvs, inpho, areohawk, etc many years ago to achieve exactly what this is showing. With wharever cameras you happen to have had at the time, strapped to whatever you feel like strapping it to.

The author of the article did not do due dilligence on the subject.

mh2292 · on Nov 20, 2017

This is nothing new- Google's "Cardboard Camera" app has been around for a few years and does almost exactly this same thing. https://play.google.com/store/apps/details?id=com.google.vr....

throwaway2016a · on Nov 20, 2017

This looks like it is actually creating a 3d model. The app you linked to I think just stitches together photos into something closer to street view. It's not actually 3d, it just looks 3d because you have enough photos to cover the whole field of vision.

nocut12 · on Nov 20, 2017

They don't do a 3d model, but it is a real stereo image. There's a bit more info in their docs: https://developers.google.com/vr/concepts/cardboard-camera-v...

DiThi · on Nov 20, 2017

> it is a real stereo image

Until you look down. Or until you have a different eye distance than average. Or you try to move a bit and there are things close enough to make you dizzy.

Cardboard didn't need any of those fancy things, of course. It was bad enough with abysmal latency.