To elaborate what the parent was talking about when they said "the way it worked...

To elaborate what the parent was talking about when they said "the way it worked" (it might also be what you're talking about). My guess is that the box is useful to straighten, unskew, and scale the image before applying the OCR.

You could /try/ to do it looking at the baseline and cap heights of the text itself, but you wouldn't know the scale in either direction. I imagine those font identifiers would have been a lot better if they could have known those things.