There are a couple reasons why we included schemas in the spec:
- JSON doesn't have a robust set of data types, and specifically lacks a datetime/timestamp type. With a schema, Taps can, for example, denote fields in the JSON that contain datetimes represented as strings, and then targets can convert those to proper datetimes and handle them accordingly.
- Dealing with un-structured or flexibly-structured data is hard. Requiring a schema forces a Tap author to think about the structure of the data up front. By validating each data point against a schema, the Tap author should be able to more quickly identify nuances in the data set - like missing fields, nullable fields, mixed-type fields, etc - and either decide to clean them out of the data (if appropriate), or provide the right schema to inform downstream applications about them. Identifying and handling these problems requires an understanding of the source data set, so it is best done as close to the data source as possible.
- JSON doesn't have a robust set of data types, and specifically lacks a datetime/timestamp type. With a schema, Taps can, for example, denote fields in the JSON that contain datetimes represented as strings, and then targets can convert those to proper datetimes and handle them accordingly.
- Dealing with un-structured or flexibly-structured data is hard. Requiring a schema forces a Tap author to think about the structure of the data up front. By validating each data point against a schema, the Tap author should be able to more quickly identify nuances in the data set - like missing fields, nullable fields, mixed-type fields, etc - and either decide to clean them out of the data (if appropriate), or provide the right schema to inform downstream applications about them. Identifying and handling these problems requires an understanding of the source data set, so it is best done as close to the data source as possible.