Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It sounds like a BLIP2 with an extra linear layer for finetuning (or aligning the Q-former with a new LLM?). What makes it more powerful than BLIP2?


It's better because

1. it's using vicuna as a base.

2. It has a pretty high quality fine-tuning dataset. I initially missed this, and it's a very important advantage.

3. (speculatively) it doesn't collapse to extremely short responses (which BLIP2 and other models trained on image-text caption pairs) because of how small/simple the adapter is.

I was interested in training a BLIP2-LLaMA model before this, and I might still do it just to test (3).




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: