The "reinvention" is not complete and will never be necessary. The difference is that XPath is necessary to query XML because it's a botched horribly overcomplicated, designed-by-committee markup language. Except for tools like jq no such language is actually required for JSON because it maps on to language structures that always exist.
Neither JSON schema or XML schema are particularly popular - and for good reason. Let's say you want to create a schema that limits field "country" to be limited to ISO 3166-1 country codes - either you:
* Keep that schema file updated by hand every time something like Sudan breaking in two happens (no).
* Write a program that generates the schema (seriously... no)
* Do schema validation in code where it belongs - pulling in relevant validation data from canonical sources, rather than some markup language invented by people who didn't have the imagination to consider a really common use case.
There's a lot of benefit to being able to state what keys may be specified in a certain location, though. Look at DSLs like Cloudformation, for instance. Having schema validation could make static analysis of this kind of code much easier to handle. E.g.: Fn::Sub may be used inside of Fn::Join, but the reverse is not true, regardless of the types "returned" by each. It's certainly possible to validate via the api, but being able to do it in my editor will make finding errors much faster.
To your other point, however, dynamic code generation is becoming much more common. AWS generates a huge amount of its code from JSON definitions across multiple languages to keep its SDKs up to date. I could see schema validation being valuable in this domain as well.
> * Keep that schema file updated by hand every time something like Sudan breaking in two happens (no).
There is a lot of use for libraries dealing with time and dates. When you want to cover all cases, at some point you get to the situation when you have to allow variable number of seconds in a minute - not always 60, but sometimes 59 or 61, or may be even different numbers. And you don't know in advance - for arbitrary long future - which minutes will have which number of seconds.
So, for your timekeeping system to maintain precision, you have to allow external updates for when a minute will be considered non-60 seconds.
And those cases could happen more often than changing a list of valid country codes.
The point isn't to avoid it. Of course it's inevitable - that was my point! The point is to use code to validate instead of some markup so that the programmer can use their judgment about how it should be delegated.
I wrote some example code below that shows how you can validate with list of countries in such a way that no code changes will be required when the list changes.
JSON Schema, at least, can refer to a URI for the definition of something, and that URI can refer to only a specific section of the JSON document to which it points.
The point I was making was that you shouldn't use a "special" language for validation at all - you should just use a library in a regular language to do it.
Anyway, code:
yaml_text:
John: Yemen
James: South Sudan
python code:
from strictyaml import load, MapPattern, Str, Enum
import pycountry
result = load(
yaml_text,
MapPattern(
Str(),
Enum([country.name for country in pycountry.countries]),
)
)
full disclosure: I wrote the validation library ^^
The idea behind XML schema, DTD, etc. is to pick a simple language to express schemas in, so that implementations in different languages have a decent chance of being compatible with each other.
Python isn’t a good choice there, as it is too flexible. For example, that code could have gotten the list of allowed country names from a file, database, or URL.
⇒ If I have to send such json to you, I almost would have to write my program in python, and even then, it could be hard for me to replicate your setup.
>that code could have gotten the list of allowed country names from a file, database, or URL.
That is exactly the point. You should be able to do that, because the canonical list of data could easily from any of those and it should the up to programmer's discretion how to fetch it.
The point of validation is to prevent invalid data from slipping through a net at minimum cost and that's how you do that.
Suden, Sudaan and South Sudan were all invalid countries in 2010 and that YAML was invalid. In 2012, Suden and Sudaan were invalid but South Sudan was not so that YAML was valid.
In the above example you have to make no code changes in order to account for that - just update pycountry every so often.
With XML schemas and DTDs either you don't validate country at all (letting Suden and Sudaan) through the net. Or, you rewrite and redistribute the schema by hand every time some dependency like a list of countries changes.
>If I have to send such json to you, I almost would have to write my program in python
Only if I choose to validate that data using a shared schema. Frankly, I've dealt with XML a lot and the number of times I've been handed a shared schema of any kind is very low. People just don't seem to use them. If they define an API in XML for instance they tend to just send examples and give a written explanation (e.g. insert valid country name here).
I don't see much value in making a schema more inherently "shareable" especially not if it means it has to be re-released every month.
Neither JSON schema or XML schema are particularly popular - and for good reason. Let's say you want to create a schema that limits field "country" to be limited to ISO 3166-1 country codes - either you:
* Keep that schema file updated by hand every time something like Sudan breaking in two happens (no).
* Write a program that generates the schema (seriously... no)
* Do schema validation in code where it belongs - pulling in relevant validation data from canonical sources, rather than some markup language invented by people who didn't have the imagination to consider a really common use case.