I have seen a few similar approaches. Basically, have several views with corresponding depth map, then render all of them and blend. As with MPI, the problem is the Quest 2 doesn't have enough GPU power to do the blending (in my tests at least). So I will be pleasantly surprised if this method runs on Quest 2, and even more surprised if it shows up on Quest 2 + WebVR. There are also limits of video decode resolution, so more views means lower pixel density in the output; VR video professionals are extremely sensitive to this. So an approach like this works better for photos than video.
In contrast, we do one view with 2 layers, and we composite them with no blending. This rendering/encoding is optimized to work with the limits of Quest 2.
The real-time quest 2 version is on its way! But you are right on the decode resolution, RVS can handle any resolution but the frames drop is important. The main point is to use a view selection method to reduce the number of views.
In the linked demo, the videos are all compressed together before being decoded in real-time, the final video packs 15 multiview video frames in one!
In contrast, we do one view with 2 layers, and we composite them with no blending. This rendering/encoding is optimized to work with the limits of Quest 2.