This sort of system has long been in use for missile guidance, though obviously using images of much larger features. Look up Terrestrial Guidance Systems if you're interested in learning more. Interesting that the linked paper makes no reference to such systems.
Not sure how practical it would be on a large scale, like city wide or larger, but in smaller areas such as a storage facility this could probably be used by forklifts and similar. Especially if combined with something more coarse like Wifi or Bluetooth based networks.
This reminds me of visual odometry from a backup camera, there where a bunch of papers on that subject about a decade ago. I don't know if it was ever used in practice, I suspect dead reckoning from accelerometers and gyros combined with map data is good enough in tunnels and other short term GPS denied environments.
I discovered this when I was looking for a ?German? company that implemented something similar and demonstrated it mounted to real cars, claiming something like "centimeter accurate" precision. I couldn't find them anymore though :(
This is an interesting idea but how does the cost compare both on computer resources and setup effort to scan the entire warehouse floor? Simply having robots with accurate motor encoders (and maybe IR sensors like in mice) and enough floor tags should allow the bots to move around the factory in safely defined paths. You can accomplish this with a basic microcontroller and can have robots quickly moving across the floor while scanning tags to update their positions. This image based system would allow more dynamic paths to be taken but now you need a much more expensive computer running image comparison and storing what I can imagine to be a massive image dataset.
Motor encoders drift. If the wheels ever slip, your encoders are now telling you the wrong position. In fact, that's how this system works: they read encoders to get a dead reckoning position, then correct it with the visual system to give full accuracy.
The advantage of a photo survey is that you can do it with the bot hardware (as in fact they do here; they dragged it around but I don't see why you couldn't do random-walk to cover an area), so setup cost should be relatively constrained. Computationally you're not exactly breaking the bank either - it's running at 4fps on an Nvidia Jetson TX1, which I suspect is overkill.
The drift, if you are moving a mouse on a desk its a very localized, temporary movement difference you have to track - the moment you try to scale this up you'll amplify inaccuracies a lot.
Also if you pick up a mouse and place it at a different spot on a surface, the mouse won't be able to register what happened.
Whereas this research project actually accomplishes that, you place your camera anywhere on a previously scanned/mapped surface, the software will instantaneously tell where you are.
So the difference: a mouse only tracks movement by relative changes. This project maps a texture onto a globally known map and calculates an absolute location.
It has been done successfully in the past. All you need to do is add the appropriate lens to focus further away.
Here they used an optical flow sensor for obstacle avoidance in a canyon. The closer to one side you got the faster the terrain would pass by the sensor. Allowing the UAV to self correct.
It's been used on hobby quadcopters for years by using a lens and IR leds for the lighting but it's only reliable at a certain height. DJI uses the same technique using a camera but GPS is more reliable and have a higher resolution once you reach a certain height.
With the default firmware. Most mice with avago image sensors have a little 8080 CPU that you can load new code into to do whatever you like, including global positioning.
How much images/features/data would Jetson need to keep in RAM for an Amazon-size (huge) warehouse floor? Or better yet, what is the dataset size per m2 needed for successful positioning?
(I've not fully read the paper properly yet by the way, but I work on 3d global image based localisation professionally for the last 4 years)
Compared to doing this for photos of buildings its simpler, as you know what direction the camera is pointing. This means instead of having to reconstruct a 3d point cloud, you can just keep a 2d image and have done with it. So at worst you'll need to keep an image of the entire area you want to position against on disk.
But, you shouldn't need to keep it all in ram. as you're not flying, you can make the assumption that you're never going to jump from one corner of the place to another. So you can have you current location and an area big enough to buffer against SSD access.
update
They have photographed the floor, and pre-computed a map by extracting SIFT descriptors, which from memory are 128*32bits. They throw most of these descriptors away and keep only ~50 per "image". To answer your question directly, an entire warehouse would probably be less than 1 gigabyte.
you shouldn't need a GPS fix, because you should know your position to a much better accuracy using the visual odometer.
for large scale 3d navigation, this approach is what we do, use a rough GPS location to cut down the search area. This was because we were using a stateless system. Its much more efficient to have a "rough localizer" to find that first fix, the optimise map loading based on current precise position and likely heading.
On this system smaller system, once you have found your initial position (which will at worse require you to go through the entire DB once) you can keep most things on disk and selectively load your active area (the maximum possible travel time + load time + fudge factor.)
Warehouse AGVs using this since 90s. I can imagine the creating of the reference points or maps for public use would need an OSM type of crowdsourced content.
https://en.wikipedia.org/wiki/TERCOM#DSMAC