The inference code and model architecture IS open source[0] and there are many other high quality open source implementations of the model (in many cases contributed by Google engineers[1]). To your point: they do not publish the data used to train the model so you can't re-create it from scratch.
If for some reason you had the training data, is it even possible to create an exact (possibly same hash?) copy of the model? Seems like there are a lot of other pieces missing like the training harness, hardware it was trained on, etc?
to be entirely fair that's quite a high bar even for most "traditional" open source.
And even if you had the same data, there's no guarantee the random perturbations during training are driven by a PRNG and done in a way that is reproducible.
Reproducibility does not make something open source. Reproducibility doesn't even necessarily make something free software (under the GNU interpretation). I mean hell, most docker containers aren't even hash-reproducible.
Yes, this is true. A lot of times labs will hold back necessary infrastructure pieces that allow them to train huge models reliably and on a practical time scale. For example, many have custom alternatives to Nvidia’s NCCL library to do fast distributed matrix math.
Deepseek published a lot of their work in this area earlier this year and as a result the barrier isn’t as high as it used to be.
That is completely different discussion. Otherwise, even Gemini 2.5 Pro would be open-source with this logic since clients are open-source for interacting with the cloud APIs.
The inference code and model architecture IS open source[0] and there are many other high quality open source implementations of the model (in many cases contributed by Google engineers[1]). To your point: they do not publish the data used to train the model so you can't re-create it from scratch.
[0] https://github.com/google-deepmind/gemma [1] https://github.com/vllm-project/vllm/pull/2964