Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My intuition would be that both text and fingers are high frequency parts of the image and most image patches across most images don't have that amount of frequency. Text and fingers are relatively rare in images compared to e.g. clouds and textures on surfaces. Because of the rareness and difficulty of text image patches the models just don't dedicate that many parameters to it.

This could be entirely wrong however.

It would be interesting what would happen on a dataset with nothing but text.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: