Sorry, I don't buy any of this. Automating the process doesn't need to imply tha...

Sorry, I don't buy any of this.

Automating the process doesn't need to imply that there's a single service with direct access to all of the data. Just from a basic software engineering perspective, it makes a ton of sense each product's data export to be a separate service owned by the product team, so no disagreements there. But by talking about how hard it is to figure out what data you have stored and export it correctly, you were implying that you had no such per-product service either, and each export is an artisanal custom job.

The question of safeguards is interesting. I don't really see how having a human in the loop is adding any real security: a computer is going to be far better at deciding whether the request is valid or not. As an operator, being assigned a ticket to do an export of account 123456, what are you going to do other than do that export? A computer, on the other hand, can actually verify whether the request is actually authorized. That can be done in a way where a compromise of your central data export service account can't be used to fake the authorization.

(A quick design sketch for one option: each account has a public key encryption keypair, managed by the identity system. When the central data export service requests an email verification, that is done via asking the identity system to sign a ticket. The identity system triggers a flow that asks the user to validate the request, and as part of the flow informs them of just what operation they are validating. User approval of the request signs the ticket with their private key. This ticket is sent to each data export service, which checks that the user id they're exporting has signed the ticket, and that the ticket contents match the request: i.e. same userid, operating is a data export, the data export covers this service. You will need to trust your identity system to not be compromised, but if it is, you're completely screwed anyway.)

> And assuming you could securely create this automated workflow, you'd still need a person manually verifying the end result to ensure that all the data scraped is in fact owned by the person who made the request. Within the past couple of years, there was a news story where someone got a different person's Alexa data after asking Amazon for their own data. That can't happen again.

The odds of a human doing a good job of this kind of validation are basically zero. Either they are following a checklist that a computer could execute more reliably, or they are just randomly poking at some 1 GB data dump trying to find the needle in the haystack.