> state-of-the-art classification networks have an accuracy in the 90% range. If...

> state-of-the-art classification networks have an accuracy in the 90% range.

If you're referring to ImageNet SOTA, is has 20000 different classes, including 120 different dog breeds [1]. This is a vastly different task than reliably detecting pedestrians where Tesla can actively curate a dataset of hard examples (from their fleet), whereas ImageNet is fixed, sometimes with low quality labels and as few as a couple of hundred examples. Tesla can also pick a point on the ROC curve to give higher recall but more false positives (which is important for VRUs specifically). Another big factor is that Tesla is using video, not still images, which makes predictions even more robust.

And that's just for pedestrians, Tesla are also using a general ViDAR (visual LiDAR) which is trained to detect obstacles that do not have a specific class. The ViDAR again operates on image sequences, not a single image, and can thus pick out structure from motion.

[1] https://en.wikipedia.org/wiki/ImageNet