- What benefit does this give to the researchers who are publishing the data?
- Who is paying for storing the data - frequently in the TB?
When the answers are "approximately none", and "the researchers" you're not going to get many takers.
And that's assuming it's easy, it reality it often won't be. If there's a lot of data, it's literally just a pain to upload it (network bandwidth). If there's sensitive data (PII), it's a pain to redact it and make sure you aren't leaking any. Data is frequently in strange formats, and it's a pain to translate it to a standard one. Etc.
---
I've worked with 3 university labs as a contract programmer. In all of them I worked with data with one of the issues mentioned above. Health information in one, TBs of photonics data in another (which was being parsed by extremely janky code too), and 4 16-bit channel images in the last (hundreds of GB of them too). Admittedly for the last it would have been easy to upload it as long as you didn't want people to be able to actually view it (on the other hand I wrote some software to let people false color them live in a browser for the lab, so that the researchers could view them).
This already exists in some fields though. Gene expression sequencing data is almost universally made public through the Gene Expression Omnibus website, and that’s quite storage intensive. It’s used since regulators and journals require it to be used.
> It’s used since regulators and journals require it to be used.
Which answers the "what benefit does it provide to the researchers publishing the data" question. A quick search answers the funding question as well, it's funded by the NIH, not the individual labs using it.
I think this example supports my point. The NIH came up with a way to give different answers to the two questions I asked, and it gets used. I'm glad the NIH has been making this a thing, it's a great use of public funds.
I'd still caution anyone from trying a "make a data platform and researchers will use it" approach to the problem unless they can answer those questions.
- What benefit does this give to the researchers who are publishing the data?
- Who is paying for storing the data - frequently in the TB?
When the answers are "approximately none", and "the researchers" you're not going to get many takers.
And that's assuming it's easy, it reality it often won't be. If there's a lot of data, it's literally just a pain to upload it (network bandwidth). If there's sensitive data (PII), it's a pain to redact it and make sure you aren't leaking any. Data is frequently in strange formats, and it's a pain to translate it to a standard one. Etc.
---
I've worked with 3 university labs as a contract programmer. In all of them I worked with data with one of the issues mentioned above. Health information in one, TBs of photonics data in another (which was being parsed by extremely janky code too), and 4 16-bit channel images in the last (hundreds of GB of them too). Admittedly for the last it would have been easy to upload it as long as you didn't want people to be able to actually view it (on the other hand I wrote some software to let people false color them live in a browser for the lab, so that the researchers could view them).