Title is such clickbait, it does not rewrite the rules of 3d vision, it is a marginal improvement on existing models, and does not work for video, only images. However, Apple open sourced the model weights, which is amazing for research.
I made it about a third of the way down. It doesn't get any better. Gave up when I hit the auto playing unrelated video that you can't scroll past. Do people really keep reading an article while a video about something else is playing on the top third of their screen? Totally nuts.
Just tried it on a "difficult" image (relatively low contrast photo of a small thin plant in front of a tree trunk with a distant fence in one corner) and it did a pretty good job, I think - https://imgur.com/a/Sqr6hR8 including the depth maps.
Was this trained on iPhone photos since there is a decent amount of depth references within iPhone cameras? It’s interesting to see how clearly it understands depth of field. With that, how does it perform on F16 and above?
(Exaggerating a bit. Ancient Greek reliefs do have sculpted detail on the underside, e.g. the horse legs come slightly detached from the surface. So they are not 100% depth maps.)