Here is an example for data-efficient vision transformers: https://arxiv.org/abs...

Here is an example for data-efficient vision transformers: https://arxiv.org/abs/2401.12511

Vision transformers have a more flexible hypothesis space, but they tend to have worse sample complexity than convolutional networks which have a strong architectural inductive bias. A "soft inductive bias" would be something like what this paper does where they have a special scheme for initializing vision transformers. So schemes like initialization that encourage the model to find the right solution without excessively constraining it would be a soft preference for simpler solutions.