That's a good question. They run the system on a small scale and validate there. The assumption is that no new error mechanism magically switches on when the simulation gets large enough, but it is did there would be no way to know.
Hopefully large-scale, verifiable demonstrations become viable in the near future. But current they're just too hard to implement.