Latency in training literally does not matter. You care about throughput. In serving, where latency matters, most DL frameworks allow you to serve the model from a highly optimized C++ binary, no python needed.
quote is 'data spend 2500 ms on the wire'. That's not latency. For a nice 10GbE connection, that's optimistically 3 GB or so worth of data. Do you have 3GB of training data? Then it will spend 2500 ms on the wire to distribute to all of your nodes as part of startup.