It sounds like a BLIP2 with an extra linear layer for finetuning (or aligning th...

152334H · on April 17, 2023

It's better because

1. it's using vicuna as a base.

2. It has a pretty high quality fine-tuning dataset. I initially missed this, and it's a very important advantage.

3. (speculatively) it doesn't collapse to extremely short responses (which BLIP2 and other models trained on image-text caption pairs) because of how small/simple the adapter is.

I was interested in training a BLIP2-LLaMA model before this, and I might still do it just to test (3).