I'm not sure this is a correct assessment. The baseline ViT model (https://huggi...

I'm not sure this is a correct assessment. The baseline ViT model (https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M) is presumably trained on a much larger dataset from HLS (https://hls.gsfc.nasa.gov/) in an unsupservised setup.

Vanilla UNET is around 7-8M parameters, this is 100M(?) so the model itself is an order of magnitude larger. There are larger models though as pointed out in the other Hacker News thread.

The fine-tuning datasets are much smaller, but that's the point - they don't need to be large, because of the foundation model underneath.