rubbing are much easier to do. While I'm guessing transcriptions are not. So you could provide a lot of training data.
You could then train one model on the rubbing->text and then each research center could make their own photo->rubbing model for their particular photography setup
You could then train one model on the rubbing->text and then each research center could make their own photo->rubbing model for their particular photography setup