Well-designed benchmarks have a public sample set and a private testing set. Models are free to train on the public set, but they can't game the benchmark or overfit the samples that way because they're only assessed on performance against examples they haven't seen.
> By default, we will not use your inputs or outputs from our commercial products to train our models.
> If you explicitly report feedback or bugs to us (for example via our feedback mechanisms as noted below), or otherwise explicitly opt in to our model training, then we may use the materials provided to train our models.
Relying on their own policy does not mean they will adhere to it.
We have already seen "rogue" employees in other companies conveniently violate their policies. Some notable examples were in the news within the month (eg: xAI).
Don't forget the previous scandals with Amazon and Apple both having to pay millions in settlements for eavesdropping with their assistants in the past.
Privacy with a system that phones an external server should not be expected, regardless of whatever public policy they proclaim.
Hence why GP said:
> so effectively you can only guarantee a single use stays private
Not all benchmarks are well-designed.