ksampath02's comments

ksampath02 · 2025-04-28T21:59:41 1745877581

One interesting part of this model's pretraining process is how they used Qwen2.5VL and Qwen 2.5 to parse public unstructured data and expand the corpus from 18T to 36T. The ability to consistently do this will push legacy companies to train their own models and enhance their edge.

ksampath02 · 2024-12-17T20:07:07 1734466027

You could try Aryn DocParse, which segments your documents first before running OCR: https://www.aryn.ai/ (full disclosure: I work there).

bambax · 2024-12-17T20:49:47 1734468587

I will try that, thanks.

ksampath02 · 2024-10-10T06:36:22 1728542182

May he rest in peace, his name will be remembered and impact felt for long.