This paper is not about machine learning. "Training" has nothing to do with the approach. External cameras are used because this paper is about trajectory generation and not about vision.
The paper presents an approach of generating a time-optimal trajectory through waypoints given physical limitations of the underactuated system. This is interesting and novel, and as demonstrated works very well. The group from which they come also work a lot with high-speed machine vision, and one of the next research steps will be combining this trajectory generation algorithm with onboard computer vision.
I wonder if we can take hundreds or thousands of such 3D map and accelerometer log pairs in order to train a model to be able to understand how to generically approach any new course.