The operation you are suggesting would not be based on CRDTs. One of the main points is that they are Conflict-free as the same suggest. So it's not allowed to reject a mutation. The merge operation "magically" has to ensure it will always confirm to the schema. That's precisely the hard part. Defining such a merge operation.
So what if the server just rebases the edit, and that results in the caption being set a second time? ie most recent wins. Don’t see how this would break a schema?
Encoding “adding a caption” as “add an additional child called caption” would be silly, as we know there can only be one caption. So the op would be “set the caption”.
1. User 1 adds a caption
2. User 2 adds a caption.
3. Both transmit to the server.
4. Server receives the first edit, applies it to the doc, broadcasts the change.
5. Server receives the second edit, sees it is applied to the previous version of the document, rejects the edit.
6. Client that receives the rejection re-bases their edit, attempts to apply it, and it is rejected as it does not conform with the schema.
So the outcome is whichever client hits the server first wins, which is fine. Where's the problem?