Do I understand this correctly: they just took Blip2 and replaced the LLM with V...

Do I understand this correctly: they just took Blip2 and replaced the LLM with Vicuna, and to do that they just added a single linear layer to translate between frozen vision encoder and (frozen) Vicuna? Additionally, and importantly, they manually create a high quality dataset for finetuning their model.

If that is the case, then this is really a very, very simple paper. But I guess simple things can lead to great improvements, and indeed their results seem very impressive. Goes to show how much low hanging fruit there must be in deep learning these days by leveraging the amazing, and amazingly general, capabilities of LLMs.