2. It has a pretty high quality fine-tuning dataset. I initially missed this, and it's a very important advantage.
3. (speculatively) it doesn't collapse to extremely short responses (which BLIP2 and other models trained on image-text caption pairs) because of how small/simple the adapter is.
I was interested in training a BLIP2-LLaMA model before this, and I might still do it just to test (3).