You have a good point, and I probably should have been more clear. When I said s...

w_t_payne · on July 2, 2020

Well, in a sensing-for-autonomous-vehicles type problem, it's actually more important to have simple and easy to specify data distributions than ones which map to reality, which in any case may be so poorly or incompletely understood that it's impossible to write the requirement for.

So, as a simple example, the illumination in a real data-set might be strongly bimodal, with comparatively few samples at dawn and dusk, but we might in a synthetic dataset want to sample light levels uniformly across a range that is specified in the requirements document.

Similarly, on the road, the majority of other vehicles are seen either head-on or tail-on, but we might want to sample uniformly over different target orientations to ensure that our performance is uniform, easily understood, and does not contain any gaps in coverage.

Similarly, operational experience might highlight certain scenarios as being a particularly high risk. We might want to over-sample in those areas as part of a safety strategy in which we use logging to identify near-miss or elevated-risk scenarios and then bolster our dataset in those areas.

In general, the synthetic dataset should cover the real distribution .. but you may want it to be larger than the real distribution and focus more on edge-cases which may not occur all that often but which either simplify things for your requirements specification, or provide extra safety assurance.

Also, given that it's impossible to make synthetic data that's exactly photo-realistic, you also want enough variation in enough different directions to ensure that you can generalize over the synthetic-to-real gap.

Also, I'm not sure how much sense the concepts of mean and variance make in these very very high dimensional spaces.

amcoastal · on July 2, 2020

In the physical sciences there are plenty of domains where accurate measurements are sparse. In a case close to home for me, it's measurements of water depth off coasts (accurate to centimeters onna grid of size meters). The place where you have these measurements in the real world can be counted on one hand. But now you want to train a ML algorithm to be able to guess water depth in environments all over the world, so in this case you need your data to be representative of a bunch of possible cases that are outside real data. This differs slightly from the GP who I think is talking about creating data that isn't even represented in the real world at all, but that would help an algorithm predict real world data anyway. But they are fairly related topics.