Hacker News new | past | comments | ask | show | jobs | submit | ksampath02's comments login

One interesting part of this model's pretraining process is how they used Qwen2.5VL and Qwen 2.5 to parse public unstructured data and expand the corpus from 18T to 36T. The ability to consistently do this will push legacy companies to train their own models and enhance their edge.

You could try Aryn DocParse, which segments your documents first before running OCR: https://www.aryn.ai/ (full disclosure: I work there).


I will try that, thanks.


May he rest in peace, his name will be remembered and impact felt for long.


Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: