Current state-of-the-art approaches in this field are significantly better than ...

Current state-of-the-art approaches in this field are significantly better than existing commercial solutions at recognizing text in the wild (i.e. scene text).

As an example, see the ICDAR 2015 results [1], where the Google Vision API is at 59.60% (Hmean) while the best ones are over 80%. Note that this test is about localization, i.e. finding the text location without recognizing the actual content, though on a more challenging dataset.

As for recognition, see the table on page 6 of this paper [2]. The "IIIT5K None" column should be pretty close to what was done in the OP, using the same dataset, with recognition accuracies of around 80% while the Google Vision API is at 322/500=64.4%. Note here that since this paper is only about recognition, there is no localization step before which would otherwise act as a filter and decrease the accuracy a bit by failing to localize some text that the recognition step would be able to recognize.

[1] http://rrc.cvc.uab.es/?ch=4&com=evaluation&task=1

[2] https://arxiv.org/pdf/1603.03915.pdf