My intuition would be that both text and fingers are high frequency parts of the image and most image patches across most images don't have that amount of frequency. Text and fingers are relatively rare in images compared to e.g. clouds and textures on surfaces. Because of the rareness and difficulty of text image patches the models just don't dedicate that many parameters to it.
This could be entirely wrong however.
It would be interesting what would happen on a dataset with nothing but text.
This could be entirely wrong however.
It would be interesting what would happen on a dataset with nothing but text.