Hacker News new | past | comments | ask | show | jobs | submit login
Many researchers were not compliant with their published data sharing statement (jclinepi.com)
205 points by miohtama on June 7, 2022 | hide | past | favorite | 118 comments



Not a researcher, but I built a data governance system for sharing medical research data. When interviewing people, one thing that surprised me initially (but which makes total sense in retrospect) is that most researchers do not want to share their data. They want to keep it proprietary until they have published as much as possible on it. In many cases, their funders require the data to be shared, so they go through the motions, but if you give them a chance they will almost always try to deny access to anybody else. They're very afraid of someone coming along and publishing on their data before they do.

It's very much not an open source mentality. Everybody wants other scientists to share their data, but nobody wants to share their own.


It's about the incentives, not the mentality. Collecting the data is often the bulk of the work, but the rewards mostly come from publishing papers that make discoveries from the data. If you want to base science on experiments that are properly planned and executed, the scientists doing that work must get the recognition they need to advance their careers.


An arrangement that would make sense would be to separate data collection from publication and make the academic success of data collectors predicated on how many publications use their data. That would incentivize data collectors to make their data as widely useful and available as possible, which is very counter to the current situation. I imagine data collection groups talking to a number of data analysis groups and figuring out what to collect that would be useful for the largest number of interesting experiments. This might also make data collected intentionally or accidentally to give a specific outcome less likely.

Unfortunately, it's absurdly hard to change what counts as credit towards a successful academic career, so this seems like a pipe dream. Academia is still waging a very uphill battle to get Research Engineer to be a long-term viable career path, despite the absolutely critical need for software engineering in all areas of modern science.


Cannot agree more. My experience: 50 to 60% of experiments are collecting data, 20 % doing the actual experiment (data processing) and 20% writing the paper (and going through peer review).


Now think about what "collecting the data" means in the medical field:

- Preparing protocols and getting them through ethical review committees

- Finding suitable patients and getting them on board

- 1 on 1 patient consultations, oftentimes multiple of those for each patient and spread through a few years.

I'd say the balance can easily turn to 90% data collection, 10% everything else in many cases...


I don't see how there's any way 90% of the recognition and reward of published research can go to the data collectors over everybody else combined, even in the cases where data collection is really that onerous.

Which I guess is why all efforts to try to split up the work in a more efficient manner have failed and the current model persists. This means that open academic research effectively remains a cottage industry compared to industrial, military, and other research types that can be kept secret.


This is a great suggestion. Data collection is itself a (sometimes big) contribution to science and humanity's state of knowledge, and we should reward it accordingly.


Data collection is more expensive, more time consuming, and usually more tedious than data analysis and paper writing. The people good at collection are not necessarily good at analysis and vice versa. More people enjoy analysis than collection. The results of the analysis are what is useful to the rest of society. Also, the people doing the analysis need to have very detailed input to its collection.

I was hoping stating it all together like that would make a solution more obvious, but I don't think it did. Having researchers commission data collection and every user having to pay would create barriers to entry we don't want.


You can't exactly decide what to reward. A lot of the important aspects of "reward" are implicit, non-legible, unstructured, informal. It's about how the community perceives you. About how many people walk up to you to chat with you at a conference. About how many requests you get to visit another lab. It's about intangible "status" and respect within the community. And it so happens that this informal respect is mostly given to people who make exciting discoveries. You can't top-down force the criteria for this. Data collection doesn't make you an exciting personality to invite to give a keynote speech or something. And those opportunities lead to more visibility, more collaborations, more papers, more grants, more recognition, more fame etc.


As a former (PhD) researcher, this is spot on. Here are the factors I've observed:

- fear of getting scooped: there's a general fear that sharing data makes it easier for competing labs to publish on the same topics, leading to the publishing lab "missing out" on publications

- fear of being wrong: sharing the data leads to more scrutiny, leading to a higher chance of a fatal flaw being found

- extra work: publishing data often involves jumping through extra hoops with little to no personal gain


Not a researcher, but have worked closely with them on data sharing.

I think there's a variant of getting scooped worth pointing out, which is that in some fields people are scared of being scooped by a low quality lab that fakes the results.

I had a very senior materials science researcher describe what he thought was a promising line of study get shut down because a low quality lab in china scooped him with results good enough for publication but not good enough that people would want to replicate. He was 100% confident their data was faked but getting grant money once that experiment was done was like pulling teeth and so he abandoned the route for other ideas.


This is interesting. Not something that I ever came across but can very much believe this happens.

Most problems in academia come down to bad incentives. Researchers are incentivised to publish for prestige and citations, which are poor proxies for improving our understanding of the world around us.


All problems, and really all actions, can be explained by incentives. Once you understand that people will never under any circumstances do something that both takes effort and is opposite to their incentives, you understand why nearly every problem occurs and why every action is taken.


This statement is either so broad that it is obviously false, or has so many implied additional conditions that its application is incredibly narrow.

* The time horizon used to judge incentives vary person to person.

* Willing martyrs exist for various causes. Therefore, it is not just the incentives that matter, but the belief about those incentives.

* Self-sacrifice for others exists (e.g. a mother rushing into a burning building for the chance of saving her children). Therefore, it is not just the incentives that matter, but also the subjective weighting of conflicting incentives.

* A person may spend a great deal of time researching betting patterns for playing craps, none of which can overcome the house's advantage. Therefore, it is not just the incentives that matter, but the knowledge of the outcomes.

* Somebody may work in an oppressive environment for their entire life, rather than changing careers. Therefore, it is not just the effort required that matters, but also the relative amount of effort on different time horizons.

While I agree that incentives are the best way to move population-level behavior, and is usually a good starting point for understanding individual behavior, your categorical statement doesn't have those caveats.


Billion dollar question is what a better proxy would be.


Possibly a reformed patent system, e.g. one in which time limits are variable or in which you don't get a full monopoly (forced licensing terms). At one point I was researching an alternative patent system in which you don't get a monopoly on the idea, but rather for every next licensee they have to pay in a share of the total value of the patent such that e.g. the first licensee pays half which goes direct to the original filer, the next licensee pays a third and that money is split between the other two licensees, the next pays a quarter etc. So the patent gets cheaper over time to license, but you lose the first mover advantages.

The problem with that system is that the "value" would have to be determined in some independent manner, e.g. an audited cost+ basis, which wouldn't be reflective of the true value of the idea. Granting time-limited monopoly rights allows the idea to be priced via normal market mechanisms, at the cost of high transaction costs.

At any rate, you get the idea. The justification for academic research is that the private sector, supposedly, won't do long term basic science and that grant recipients will. That's a very dubious set of assumptions. We see basic science being done by private labs all over the place, most obviously in AI but also in many other areas of computer science, we see it in the biotech sector, we see it in semiconductor physics etc. The areas where there's no private investment tend also to be the most controversial areas of academia where many people have concluded that the results are systematically useless e.g. social sciences. And meanwhile what we see amongst grant recipients is systematically unscientific behaviour like knowingly publishing claims that aren't actually true, refusing to share data even after promising to do so, accepting unvalidated modelling as 'science' and a million other things.


This doesn't make sense. Being scooped means other people published the same idea before you.

If he truly believes the released data were fake, he can for sure publish his results. It’s not unusual for scientists to have different views on certain subjects. Some journals even have specific discussion sections for contradictory results. Finding the truth and correcting scientific records is also an achievement.

And at least, in the filed of biology/medicine, there are journals accepting “second first” papers, for example https://elifesciences.org/articles/30076 .

And quit a lot of publications are not grand breaking ideas. Incremental improvements can also be published.

I don’t understand your friend’s rationale to abandon the whole project he already started and didn’t want to find the truth and prove his theory (maybe it’s a different field thing).


> If he truly believes the released data were fake, he can for sure publish his results.

You need the funding to get to that point though, and since it wasn't "new" they weren't able to get funded to research it. That's the meaning I got from the comment.


> I don’t understand your friend’s rationale to abandon the whole project he already started and didn’t want to find the truth and prove his theory (maybe it’s a different field thing).

He has lots of ideas he wants to try, and he ranks them. The incentives for following this particular idea were reduced by the actions of the other lab, so it dropped in the rankings and he pursued one of the more highly ranked ones.


Am a researcher, and also work in data governance, data sharing, etc. for medical research data.

It's not about an "open source mentality" - almost all of my analytical code is online, and when I work with simulated data sets (the bulk of my work) it's freely available.

It's about incentives, and effort. A couple examples:

1) Primary data collection is hugely time and resource intensive. It's hard to be okay with all of that, getting a single paper out, and then immediately having that be expected to be freely open and available.

The argument that it's necessary for reproducibility is a compelling one, but also one that's much rare than what most people are looking for, which is to do new and novel research using your primary data.

2) Research has a potentially long time from data collection to publication. It's entirely possibly the reason isn't "I may want to work on that someday" but "A graduate student is working on that right now". I have been in the position of potentially having a graduate student be scooped on something, and it is terrible (it ended up being a false alarm). They're potentially also the person who put in a lot of the labor on publishing the paper in the first place.

3) Sharing comes with an unknown expectation of support, documentation, etc. wherein there's absolutely no incentive structure in place to support the work that takes. I write in time and budget to do so in my grants, but that also mean my grants have less science in them than other people's, and I have to rely on that mattering to someone enough to make up the difference.

4) Genuine data ownership issues. Health data is especially complex in this space. Do you want a simulated data set that will give you the same broad patterns? That's easy enough. But if you want the real data - that's potentially a very complex issue wherein even if I say "Sure, let's get that process started..." you're potentially looking at a year or more before you get the data.


This x100.

I'd also add that primary data collection is risky. You can spend months to years working to get data, only to end up with nothing---not even a null result---if the experiment itself goes south. I lost over a year of my PhD when an animal I had spent ages training fell while playing. A classmate had a virus wipe out her colony of genetically-engineered mice. Using someone else's data obviates these sorts of risks--while doing a better job preparing you for non-academic careers too!

As a result, I think it's not totally nuts for the original data collectors to maybe get some kind of "risk premium." I can imagine a lot of ways to do it, but early/exclusive access to data has been one of them.


A colleague of mine during graduate school who by consensus did everything right, and then just...no one in her cohort had the exposure she was interested in. To this day, no one is sure why.

The answer for a lot of this is to put in the hard work of building relationships with people who collect a lot of primary data, and make sure your work amplifies their work in a way that benefits you both. But that's a lot of effort, hard, and uncertain in its own ways.


I don't have any direct research but am tangentially aware of some of this. I guess my question comes down to why getting scooped matters?

I definitely understand the worry when it comes to high profile discoveries, but surely the work a graduate student is working on, isn't usually that high profile?

And to be clear, I'm not downplaying the importance of anyones contributions, but rather why the focus isn't shifted onto where data is sourced from? If someone used your data, shouldn't it be immediately obvious if they also have data publishing requirements? And if thats true, why isn't it a bigger faux pas to use another labs data before that lab has even published their paper on it?


Graduate students have like four good years of research time. If your work is scooped it will almost guarantee low citation counts but very likely it means that the work cannot be published at all. A year down the drain.

If you want a faculty position after graduation you've got one shot to create a strong research profile. Getting scooped once can close career doors forever.

People also often don't understand what grad students do. They do all of the research. 100% of it. PIs are funding the lab and providing mentorship. The first authors on major groundbreaking papers are usually grad students.


In some ways the graduate student is working on the most important work of their career. A PhD requires novel publications and if you get scooped on the main result of your thesis so that you don’t end up with the published paper on that topic very few organizations will grant the PhD. So this can result in losing out on getting a degree you put five years of your life into.


Being first is almost everything -- the product in academic research is new knowledge.

How much press was given to the second time the LHC discovered the Higgs, with higher precision?

Robert Dicke, in the running for "greatest experimental physicist in history", was working rapidly toward discovering the cosmic microwave background, with a keen physical insights driving the research. His team was beaten by a team at Bell Labs, who, when they discovered an irreducible isotropic source of noise emanating from the Universe, literally called Dicke [1] to help them understand the signal.

Who got the Nobel Prize? The team from Bell Labs. Never mind that Dicke literally wrote the companion paper explaining why the Bell Labs measurement was so important. Why? Because they were first.

Regarding graduate students: It is extremely common in physics to have the lead author (and coordinator) on important/cutting-edge research be a graduate student. It is only the largest collaborations where this isn't always the case.

Graduate students tend to follow a single thread of research from beginning to end, learning the ropes on the way, while faculty generally manage and mentor multiple threads (each with a student) simultaneously. When it comes time to publish, frequently it is the student who knows all of the details of the measurement the best.

For your final question, the answer goes back to "because being first is almost everything", combined with "we have to give the credit to somebody, right?". If someone notices that the eighth word on 50% of Guinness World Record citations is "person", should the credit for the discovery go to the person who noticed that interesting fact, or should it primarily go to Guinness World Records, who assembled and produced the dataset over decades of work?

Or, more-concretely, when someone applies for and receives fifteen minutes of time on the Hubble Telescope and, using a clever time-domain analysis, happens to find a repeating optical signal that blinks 13 times, then 53 times, then 13 times, then 53 times (as good a SETI candidate as you'll find), who gets the credit? Is it the researcher with the keen insight and good luck, or is it the huge amalgamation of humanity, spanning a half-century of science and engineering, that made such a measurement possible?

[1] https://theconversation.com/the-cmb-how-an-accidental-discov....


So, the caveat to all of this is I'm actually in a field (epidemiology) where getting scooped matters less in many cases than it does for some others (math).

Getting scooped doesn't just matter for high profile discoveries. The example I was thinking of is the graduate student's research asking "Does this thing the field is doing make sense?"

That's a Yes or No answer. There's not a lot of interest once the question has been answered. Even if it can be published, it'll be a harder road, potentially finding itself in a less prestigious journal, etc. At worst, it's a write-off. Most graduate students are coming out with 2 or 3 paper dissertations, so the loss of one, or taking a hit on another one, can set someone behind, dramatically change the tenor of their job search, etc.

"And to be clear, I'm not downplaying the importance of anyones contributions, but rather why the focus isn't shifted onto where data is sourced from? If someone used your data, shouldn't it be immediately obvious if they also have data publishing requirements? And if thats true, why isn't it a bigger faux pas to use another labs data before that lab has even published their paper on it?"

I mean, it might not be the only thing that's been published on that data - that's the point. Once it's publicly available, it's publicly available, even if people are still working on it.

Indeed, one of the "Data is available upon reasonable request" things is "We have a student working on this, could you not?" (another being "You're a crank, and we're not sending you this to misinterpret/misuse it"), but a lot of open data folks actively dislike that for obvious reasons.


I can't speak for all research datasets but I can speak for sequencing based datasets such as whole genome sequencing, RNA sequencing etc. These projects can cost hundreds of thousands to millions of dollars. Academic and industrial research projects tend to do everything they can to avoid sharing this data for as long as possible.

The academic researchers that generate these datasets and their funding sources such as granting agencies and donors want to see the money being used to generate something important. In this case it certainly in the researchers interest to avoid sharing the data for as long as possible to get as many publications as possible. This makes it easier to get more funding.

There is a push by journals now for researchers to include the data in publicly accessible repositories such as the NCBI SRA or European Genome-phenome Archive. These sites archive the data and make it available for researchers. However, in most cases they require a comprehensive data sharing agreement between your institution and theirs. Most of these agreements have extremely demanding requirements that make it very difficult if not impossible for a requesting institutes legal team to agree to. For example, the institute that owns the data may make demands such as "the right to approve any publications or research work created with this data", which entails sending any publication to them prior to submission and giving them veto power. I understand the reasoning for the agreements their intent was to protect private information from being shared publicly while allowing researchers to use the data. But I think the system is being abused now.

On the other hand, it can be very difficult to get past institutes to make data public. For example, sequencing data contains personally identifiable information and getting approval to share the raw data can be difficult. For prospective studies, Human patients need to consent to their data being shared and may not consent to it being shared publicly.

I can't comment on industry as much. However, I have colleagues who work at companies and publish numerous papers on the same proprietary datasets but refuse to share it with anyone. This is particularly challenging in fields such as cancer research where they may publish some fancy/superior new model for disease risk stratification on their own data but without sharing it's impossible for researchers to independently validate it.


I'm an RSE building such a tool right now for a clinical trial, and came in knowing this is the norm.

We're experimenting with building a data collection system that uses adversarial/multiplayer game mechanics incentivizes researchers to share (gain co-authorship) and excludes them from the benefits (e.g. co-authorship) if the minimum submitted data threshold isn't met.

Still early days, but we figured this is the only way to get people to share data, even within other people within the same project / lab. Similarly, I'm getting a lot of ideas from HN on event-driven data collection and immutable data storage to help make the data more tamper proof.


I would love to talk to you about this. I am currently doing something similar but accentuating the dataset by adding info on publications etc. thereby making them more "legit" (for the lack of a better word)


This is really interesting! I'm interested in knowing more and sharing with colleagues in academic library/publishing tech, if you want to drop a link! (and are interested in sharing your work :) )


Couldn't this be fixed by having the creator of the data as a co-author on any paper that uses the data?

The basic conflict is that it's hard/expensive/time-consuming to make data, and easier to write papers on existing data sets. So just split the work, and make it a legitimate occupation to create good data sets. As of now, generating the data by itself has no benefits for scientists, other than the papers they can subsequently write using it. Feels a bit like having to make a screwdriver every time you want to work on your car.


A better solution is to properly credit researchers for releasing data that then gets used by caring about citations to data.

That needs to be tracked better, things are improving but until there are metrics and tracking as a common part of evaluation I don't think it will change.


As a sibling says, it is probably better to improve citation and culture around the importance of datasets.

As an experimentalist, I really want people to make use of my measurements, but I really don't want to be a co-author on a paper that makes wild and incorrect conclusions using my data.


I've heard one even worse.

I know people who were told by their PI to tell other researchers at conferences that they were working on the stuff that they already tried that didn't work. This would encourage competing labs to try out these techniques and waste their time.

Being scooped is a huge blow to a career, for faculty but more intensely for graduate students. With arXiv, you can be scooped at literally any moment rather than just on the known conference schedule. Being scooped was personally traumatic for me and I still have nightmares about the experience.

This leads to a massively toxic ecosystem of over-competition where individual labs win at the detriment of scientific progress.


Interestingly, one of the best ways to rack up citations in ML is to publish a benchmark dataset.


Perhaps the incentives need to be re-aligned to separate the collection of data and the publishing of studies based on that data.


It's not just publishing it before them, it's opening themselves up to criticism. If someone wants to look for mistakes or flaws in your data, it's easier with access.

Of course, arguing about the right answer is supposed to be what science is about. That doesn't make it any more pleasant.


Sounds like it's time to split the industry. One side collects the data and puts it into common repositories. Another group does research using the primary data.


That sounds very inefficient though.


Oh I would love to talk to you about this. I am working on building a similar system and encountered this so much and other issues.

Is there a way I could contact you?


Most researchers are not super smart, and so they cannot get out of the data more than is relatively obvious. If they work hard to obtain the data, but then share it freely, someone else will just steal their lunch.


I work as a bioinformatician (programmer in genomics), a typical paper may involve a dozen sequencing samples (~25G each), and a few thousand lines of scripts to produce CSVs/graphs

When I started my lab head got a request for an old paper (before my time) and I had to hunt down the data, put it on the web, find the code, send an email etc - it was a lot of work, almost as much as just sharing it in the first place.

So now all the code goes up on GitHub, all the sequencing reads to public archives, and we fill the supps full of any intermediate CSVs etc we think people might find useful.

I saw it as "people bothering me so I would do admin" but some lab heads see being requested as "strategic networking and being informed about other research in the field" where they can see what others are doing and gain collaborations / be put onto papers etc

I guess it's just expectations: I've met surgeons who have the expectation that because they kept a piece of tumor from a surgery and gave it to a lab, they will be on any paper that ever uses data from that tissue until the end of time (even on labs on the other side of the world re-using the same data).

When I'm collecting public data, I usually don't bother to ask due to the hassle, and move straight on to the actually public stuff.

My bet is that papers with public code/data are used/cited way more often, but the tradeoff is probably their authors get less total papers over all.

Do you want a larger number of people slightly happy with you or a fewer number of "close friends" in your area who may look kindly on you for grants or paper reviews.


I think paper citations are an archaic method of tracking contributions to research. We have all this technology, and Git is a great example, that allows us to more granularly give credit to contributors.

In fact, even the very idea of a published paper feels out of date. Why have one model that is set in stone that is published as one snapshot in time?

In the DevOps and Cybersecurity world our alerting mechanisms are essentially simple data models (sometimes not so simple) that are tapped into the raw data feeds, and can be updated and refined on an ongoing basis.

While instrumenting the physical world can have more challenges, I still feel that the sciences have a lot to be gained from some of the more "pedestrian" operational processes that we use in the technology industry.


I'll echo that it's a tremendous amount of work to appropriately format data and metadata for upload to a public repository. It's not as simple as just mass uploading raw data files. It's no surprise that people skirt around this requirement when possible, especially as the size of files and number of samples per paper continues to balloon.

That's not to say that it's not important, but the labor required is vastly underappreciated. I say this as someone who has completed the task for many papers.


Yeah this is the main reason, as anyone actually doing research knows. I don’t think civillians realise what an incredible amount of work it is to publish a paper.


yeah. holy hell is it a lot of work. hate doing it even when I know I'll get a paper out of it.


Hell, even things like journals requiring formatting after acceptance being taken as a godsend.


Is there some way to work this in to the system? Like by having a different person do this work? At a very high level, all of science would benefit from more data being properly formatted and published, so maybe this is something that could be budgeted for in the planning phase? Or the work could be done by undergraduates studying for the field?


The amount for a non-modular (i.e. the default, you don't have to have a super detailed budget) NIH R01, which is the basic unit of biomedical science funding, has been stagnant since 1999.

Budgeting for a separate person to do this would eat up a tremendous amount of said budget.

Which is fine, except if we think about this from a productivity standpoint, the person who budgeted for that person is probably short a graduate student compared to the person who didn't, on the outside chance that someone cares about their stuff. For example, I looked at my lab's publicly available repositories linked to papers on Github.

Watches: 15 (probably half of these are people involved in the project) Forks: 3 Stars: 4 Visitors in the Past Month: 4

That's it. For all of them. While I keep doing it based on principle, if I stopped tomorrow, it would impact me not at all.

And undergraduates, candidly, are not usually time savers.


I only have a bachelor’s degree, but I’ve worked in research labs and read a lot of papers. I’ve always thought it was silly that all data, design documents, and code is not published as a standard. Is it just protocol from a time when it was more difficult to share data and designs? In the modern era, why shouldn’t research colleges and journals maintain a digital database of data, design documents, and code?


Because, from my understanding: it's hard to preserve things and the code is usually pretty crappy and the data is often unintelligible without a lot of help from whoever was processing it. Obviously preservation of research outputs is a noble cause, but there's very little incentive for anyone to actually document and preserve these things unless it's a requirement of the field (e.g. high-profile AI stuff). The incentive to not do that is that you save time and can jump to the next project, and also no one is pestering you about your crappy code and/or data.

Think about all of the fancy dependencies a piece of code might need to run. Think of all of the processing steps some data needed to go through. Think of the work required to set up some computational cluster without which the work just doesn't happen. Doing it once is hard enough, documenting it so anyone else can do it is maybe an order of magnitude harder.


> it's hard to preserve things and the code is usually pretty crappy and the data is often unintelligible without a lot of help from whoever was processing it

Which is a huge part of why the replication crisis is such a thing (besides the outright fraud and the publication bias). The very fact that the datasets and codebases are so disgusting is precisely why the results coming from that data can't be trusted.


> Which is a huge part of why the replication crisis is such a thing

Is it though? From what I’ve seen it is mostly caused by p-hacking/small sample sizes/poor experimental design.


Those are the causes that were picked up on initially, because they're the problems you can detect just by reading a paper without getting into the details. They're not the only causes though, just some of the most visible.

An incomplete list of other causes might contain:

• Wrong data or maths. A remarkably large number of papers (e.g. in psychology) contain statistical aggregates that are mathematically impossible given the study design, like means which can't be calculated from any allowable combination of inputs. Some contain figures that are mathematically possible but in reality totally implausible [1]

• Fraud. I've seen estimates that maybe 20% of all published clinical trials haven't actually been done at all. Anyone who tried to figure out the truth about COVID+ivermectin got a taste of this because a staggering quantity of studies turned out to exhibit disturbing signs of trial fraud. Researchers will happily include obvious Photoshops in their papers and journals will do their best to ignore reports about it [2]

• Bugs. Code doesn't get peer reviewed, sometimes not released either. The famous Report 9 Imperial College London COVID model was in development since 2004 but was riddled with severe bugs like buffer overflows, race conditions, and even a typo in the constants for their hand-rolled PRNG [3]. As a consequence their model produced very different numbers every time you ran it, despite passing fixed PRNG seeds in on the command line. The authors didn't care about this because they'd convinced themselves it wasn't actually a problem (and if you're about to argue with me on this, please don't, if I have to listen to one more academic explaining that scientists don't have to write deterministic "codes" I'll probably puke).

• Pretend data sharing. Twitter bot papers have perfected the art of releasing non-replicable data analysis, because they select a bunch of tweets on topics Twitter is likely to ban, label them "misinformation" and then in their publicly shared data include only the tweet ID, not the content. Anyone attempting to double check their data will discover that almost all the tweets are no longer available, so their classifications can't be disputed. They get to claim their analysis is replicable because they shared their data even though it's not.

• Methods that are described too vaguely to ever replicate.

And so on and so forth. There are an unlimited number of ways to make a paper that looks superficially scientific, but doesn't actually tell us something concrete that can survive being double checked.

[1] e.g. https://hackernoon.com/introducing-sprite-and-the-case-of-th...

[2] https://blog.plan99.net/fake-science-part-i-7e9764571422

[3] https://dailysceptic.org/archive/second-analysis-of-ferguson...


> [...] computational cluster [...]

I would wager that the vast majority of analyses / code, in most fields of research, can run on a basic laptop. If I had to put a number on it, I'd say >99.999% of papers.


Maybe run in a matter of days or weeks or months, but people often want the results faster than that, especially if there will be a 2nd set of analyses or a conference is coming up.


There’s no way it is that high. A remarkably high percentage of work across the sciences is “computational” these days, which involves potentially very intensive simulation.


I think you’d be surprised how many are just a halfass R script, or some disorganized SPSS/Excel or Graphpad / other proprietary software work. :P

Not in terms of proportion of data of course, but in terms of proportion of papers.


Honestly, the stuff I run on a cluster is an order of magnitude easier than the stuff that can run on a basic laptop to share.


Lol, there was one time I churned out a series of magic numbers by repeatedly running several pieces of code in an IPython notebook, measuring (with significant contribution from eyeballing a graph) and tuning the input parameters by hand each time. The magic numbers worked and were put into production at LHC (yes, the Large Hadron Collider), but good luck turning the process into reproducible code without writing a lot of new annealing code which would at least double the amount of effort.

And I was the most organized programmer among my peers by a long shot. Judging from the messiness of shared code, I can’t imagine how bad other people’s unshared work is.


Some of the HPC code that I've used is definitely "Good luck, and god speed..." if I gave you the code.

Some of it is just really clean Python code that generates a bunch of simulated data to train some machine learning code on with human readable configuration files.

But neither one of them involve the words "$COUNTRY Ministry of Health approval...", which kicks things to a whole new level.


That would be a nice problem to have. In order to fix a problem it first needs to be an actual problem rather than a theoretical one.

Maybe the goal should merely be for the author to be able replicate the results from the torrent archive (for a fee of course). That way anyone who needs the conclusion to be correct can buy the validation.


To illustrate a couple problems with it being standard (importantly, note this paper was in a clinical epidemiology journal, so we are invariably talking about human health data):

1) Are you sure the data is deidentified? Can your analysis be done on the deidentified data?

2) Do you actually own the data? This is more complicated than "Well you published a paper on it...". The data is likely governed by data use agreements. Those are often extremely complex - especially at the international level. Or if you're working with particular populations - they're not wrong to be wary of who their data is given to, given the history of how science generally has treated them.

3) Is the data in a form that's genuinely suited to being shared? Is it in a flat file, with clear and easy to understand variable names, a good data dictionary, etc.? Is missingness, and the reasons why, clear and evident (the difference between a variable with a lot of missingness and a variable that's a bit garbage is often subtle)?

4) Where do you put it - and importantly, who pays for that, and how long?


From my perspective (postdoc in astrophysics) things are quite complicated.

I agree generally with data and code sharing policies (all of my recent papers include zenodo deposits of data, and I work on a large open source simulation code). However, there are some serious issues here:

1. Code licensing is not always clear. Sometimes you are working with a code that some PhD student wrote, who left the field, whose advisor shared it with you. Can you share this code? Not necessarily... but this problem will go away with time.

2. Some codes include proprietary components, and chopping them out can make the code inoperable (or useless). One notable instance of this is a very famous hydro code that was used to produce predictions for nukes, that was then repurposed for running simulations of planetary collisions.

3. Sharing data is prohibitive, and if you don't have the data having the code might not really help. Our projects can produce Pbs of data, costing 10s-100s Ms of CPU hours, and as such making the data open source can be really tricky. This then means you can only share a selection of runs. On the other hand, even if you had the code, you can't reproduce a lot of the results, because of the cost of running the simulations... This actually will get _worse_ over time as we run bigger and bigger simulations.


>1. Code licensing is not always clear. Sometimes you are working with a code that some PhD student wrote, who left the field, whose advisor shared it with you. Can you share this code? Not necessarily... but this problem will go away with time.

Isn't it the university who owns everything? I don't know about astrophysics, but it's a big problem in engineering. Universities have been known to patent and resell research made by msc/phd student and profit from it without paying any compensation.


This depends on the region/university/timescale/etc.

Consider this: student at institution A (UK) writes code, postdoc at institution B (USA) modifies it, shared with researcher at institution C (China) who runs it for a paper first-authored by student at institution D (Germany).

Who has the rights to share the code and data? Which jurisdiction would this fall under?


> In the modern era, why shouldn’t research colleges and journals maintain a digital database of data, design documents, and code?

Because the incentive in academia is to take a single idea that kinda works and parcel it out into as many, generally low quality, published papers as you can.

The moment you release the data somebody else can start parceling out those papers instead of you.


Makes me curious whether researchers would use software that integrates all these features for them automatically, making it easy to publish and share data sets in a consistent format online.


I think you've got to be asking

- What benefit does this give to the researchers who are publishing the data?

- Who is paying for storing the data - frequently in the TB?

When the answers are "approximately none", and "the researchers" you're not going to get many takers.

And that's assuming it's easy, it reality it often won't be. If there's a lot of data, it's literally just a pain to upload it (network bandwidth). If there's sensitive data (PII), it's a pain to redact it and make sure you aren't leaking any. Data is frequently in strange formats, and it's a pain to translate it to a standard one. Etc.

---

I've worked with 3 university labs as a contract programmer. In all of them I worked with data with one of the issues mentioned above. Health information in one, TBs of photonics data in another (which was being parsed by extremely janky code too), and 4 16-bit channel images in the last (hundreds of GB of them too). Admittedly for the last it would have been easy to upload it as long as you didn't want people to be able to actually view it (on the other hand I wrote some software to let people false color them live in a browser for the lab, so that the researchers could view them).


This already exists in some fields though. Gene expression sequencing data is almost universally made public through the Gene Expression Omnibus website, and that’s quite storage intensive. It’s used since regulators and journals require it to be used.


> It’s used since regulators and journals require it to be used.

Which answers the "what benefit does it provide to the researchers publishing the data" question. A quick search answers the funding question as well, it's funded by the NIH, not the individual labs using it.

I think this example supports my point. The NIH came up with a way to give different answers to the two questions I asked, and it gets used. I'm glad the NIH has been making this a thing, it's a great use of public funds.

I'd still caution anyone from trying a "make a data platform and researchers will use it" approach to the problem unless they can answer those questions.


This is a niche that datasette (by simonw on here) could fill very nicely. If you can wrangle it in to a csv, then datasette can publish it as an api with an sql interface.

I'm not sure how this could work with things like photos etc (though there are plugins for some of this)


The NIH has mandated that starting in 2023 all grants that it funds must include a plan for data sharing. A serious lingering question is whether there is actually going to be any enforcement, given that it is already a monumental challenge for investigators to even account for the data that has been submitted/shared/archived by their lab, forget some 3rd party trying to do it.

The tools for doing this across a whole diverse ecosystem are shall we say not user friendly (if they exist at all), and that is being charitable (and I say this as someone who has written some of those less than user friendly tools). Most of the systems, archives, etc. are not interoperable, they don't share common APIs, much less have some common interface that would allow someone to track all their published data, forget integrate it.

Program officers are going to be desperate for ways to account for this stuff, but I worry that at least initially they are not going to have the leverage needed to get researchers to engage in even the most basic data management practices which would allow them to start solving the problem down the line.

How many files did you upload to archive X? Which project were they for? How many datasets are you expecting to produce for this grant? You have published 10 papers associated with this grant number, where are the datasets that correspond to those papers?

Some investigators can answer these immediately and easily, others will throw a screaming fit that they were even asked for such a thing.


The DAT protocol originally came out of this need. It's a very neat data sharing system that's designed to work with huge files, and it also came with a user friendly browser called Beaker Browser.

Unfortunately the project seems to have split or evolved into a different project or something. They went through some rebrands, a rewrite, Beaker got abandoned and I lost track of what they're doing now. I think their grant funding ran out. But it was a pretty good attempt at making this stuff easy or easier.


When I was a university researcher, after publishing my first paper in the role I carefully read the policy requiring making data available, as I wanted to be correct by the book, but also have my data (which I took pride in) publicly available for critical review. The process as described was simple, though I soon found there was no mechanism implemented to actually achieve this. URLs and systems no longer existed, no alternatives available. I requested to publish via an external provider and was informed I would be fired if I did so. Disappointing to say the least.


The only way I'd expect raw datasets to be published online immediately in a useful, curated manner by a research group is if that was stipulated in their research grant, period, i.e. that's what the goal of the project was, at least in part.

Now this is not entirely uncommon in many research areas; you can get all the current weather and climate datasets with little difficulty as these are shared globally (as is necessary for effective short-term weather forecasting). Anyone publishing DNA/RNA/protein work generally uploads those sequences to GenBank, and that gets updated every two months, so is pretty current relative to the published literature:

https://www.ncbi.nlm.nih.gov/genbank/

However this article seems to focus on "Biomed" and that whole world is where pharmaceutical/corporate influence in academics is off the charts. I'd guess well over half of the researchers (if not over 90%) have some dream of getting their work patented, setting up a biomed startup, and having Pfizer or J&J swoop in and buy them out (they then get a percentage and become wealthy), and their university patent office is likely pushing for that as well (they get another percentage). No way are those people going to share their data, which might very well be slanted or cooked in ways that will make their little entrepreneurial effort more attractive to a big buyer. It's way more like pitching to a VC than doing actual rigorous scientific research in the tradition of open sharing of results and ideas - at least not until the patent is granted. Then you're into whether or not the clinical trial data will be released to the public, and corporations don't like to do that either (as someone might find that their claimed success story is riddled with statistical errors, etc.).

Business and science have orthogonal goals, and mixing the two is almost always a recipe for the rise of fraudulent con artists pushing bogus BS in the name of turning a quick profit.


If I were a billionaire, I'd fund a research institute whose only purpose would be to detect and call out fraud and shoddy work in academia. I'd hire the very best people I could in each field and put them onto this task. I imagine it would be easy to find excellent but disgruntled scientists in every niche.

You'd need to hire the best people with niche expertise. I know from my years in academia, nobody can understand and scrutinize your bullshit unless they work in the same sub-sub-sub-field.


Just create a glassdoor-like site, but for ex-students tô tell their experiences with research groups and advisors.


That's not going to work very well, you'll get a mixed cohort of people reporting actual fraud and deceptive practices (the desired group) and people who were just pissed off because they didn't get the support they needed (a legitimate complaint, but not as relevant to the issue of fraud) or had negative personal interactions (not such a legitimate complaint, as they might have been the problematic person).


I think the same issues affect Glassdoor, but it didn't stop the site being a hit. Just let people speak, and the audience judge by themselves and filter the outlier report from the problematic person.


But who would listen? There are already people who do this work for free, no billionaires needed, but we're all told not to listen to anything they say on pain of being classed as an anti-science "denier" / conspiracy theorist / etc.

Fundamentally, 97% of scientists (in this study) refused to share their data because the people paying them don't care. The people paying them also wouldn't care about whatever your billionaire foundation would say, and in fact many would class it as a right wing misinformation operation. Then you'd be blocked from Twitter, downranked on Google, blocked from YouTube and you'd be unable to hire the "best" people because they'd be socially ostracized for joining.

The problems here are really very deep, and they're social in nature. You can't fix it by simply pointing out the problems because everyone who looks already knows what the problems are. They're obvious and numerous. Fixing them requires fixing the incentives but politicians, the media and large chunks of the electorate refuse to acknowledge that research fraud can even exist at all (the "believe science" brigade).


that just means that you need to hire half disgruntled scientists and half disgruntled PR folks, journos, attack dogs, muckrakers, etc, doesn't it, then?


Valid points. It's an uphill battle, but still winnable over the long run. A big part of it is maintaining an extremely high level of scientific rigor and professionalism.

Take the NSF for example. (Pardon a US-centric example.) If you have a record showing that a particular PI has published 10 provably false/fraudulent papers, is the NSF really going to keep funding that person? At some point, the desire to avoid embarrassment and save their own skin may drive the NSF to de-risk by not funding the flagrant frauds.

Imagine if this were a world-class research institute, supported morally and with funding by people like Bill Gates and Elon Musk. Imagine if those people recognize that we can advance science and global progress much farther and faster if we tackle the epidemic of garbage research in academia (which I'd say is over 60%). Imagine running TV ads that say: "No, we are not right-wing nutjobs. There really IS a plague of garbage research."

(In fact, right-wing sponsors should be totally ruled out. As well as clearly left-wing sponsors. We want basically nonpolitical people who are basically trusted, like, again, Bill and/or Melinda Gates, Elon Musk, and so on. So, no Peter Thiel, which is unfortunate, because it would be right up his alley.)

But then you're publishing papers that try to replicate studies and specifically calling out specific frauds and getting their funding taken away, and more broadly arguing for funding agencies and academic departments to stop tolerating the culture of fraud and shoddy work. And you're calling out the worst offenders (in terms of academic departments and funding agencies) and calling on their political sponsors (like Congressmen) to cut their funding if they don't clean it up.

There may be areas of science where the foundation doesn't go, like climate change (maybe race?), because they are too heated. That's OK. It doesn't have to cover everything.

Part of the reason academia is really broken is, people aren't rewarded for proving that results can't be replicated. We need to find a way to incentivize that. The scientific process does not work without it. Also, academia doesn't allow people to just say, "This idea is wrong."

Academia is composed of a million little old-boys-club silos. You don't get into any of these clubs if you are not going to play ball. De facto, you cannot criticize your peers, if you are in academia.

For instance, the NSF has grant-award panels that are composed of researchers in the niche. So if you criticize your peers you literally cannot get NSF funding. Because your funding is literally gated by your "peers" as in "peer review." But if you were gonna be that way, you wouldn't have gotten a tenure-track job in the first place, because tenured professors only want people in the club to become their "peers" and sit on those committees.

I'm concerned I may be kind of overstating my case, but it's hard for me to know exactly how widespread the worst-case version of the problem is. Certainly what I'm describing can and does happen.


"I'm concerned I may be kind of overstating my case, but it's hard for me to know exactly how widespread the worst-case version of the problem is. Certainly what I'm describing can and does happen."

You're absolutely on the ball, except that everything is ~100x worse and more hopeless than you're describing :( Far from overstating it, you're understating it.

"If you have a record showing that a particular PI has published 10 provably false/fraudulent papers, is the NSF really going to keep funding that person?"

Absolutely. The problem is: showing fraud according to whom? I can point to dozens of papers by the same authors that I personally consider provably false/fraudulent, funded by the NSF. That isn't going to achieve anything. People can't even reliably get papers retracted, let alone cut bad researcher's funding.

From https://fantasticanachronism.com/2020/09/11/whats-wrong-with...

"If you look at the NSF's 2019 Performance Highlights, you'll find items such as "Foster a culture of inclusion through change management efforts" (Status: "Achieved") and "Inform applicants whether their proposals have been declined or recommended for funding in a timely manner" (Status: "Not Achieved") ... We're talking about an organization with an 8 billion dollar budget that is responsible for a huge part of social science funding, and they can't manage to inform people that their grant was declined! These are the people we must depend on to fix everything."

"with funding by people like Bill Gates"

Bill Gates is one of the primary funding sources for epidemiology, which is more or less dominated by pseudo-science and fake research. The article we're discussing was published in a journal of epidemiology so presumably the researchers polled for data were mostly epidemiologists, and only a few percent of researchers were honest about basic things like whether they'd share their data or not. You can imagine how bad the rest of it gets.

Gates doesn't care. Like all the other sources of funding for public sector research, he wants to be a philanthropist cheerleader for Science™. The act of spending money is the end, not the means. Whether that money is spent well and yields accurate conclusions is neither here nor there to him, as he has too much for that to be relevant.


> You're absolutely on the ball, except that everything is ~100x worse and more hopeless than you're describing :( Far from overstating it, you're understating it.

I mean, you can't get 100x worse than "60% of research is invalid."

> Absolutely

If you get enough pressure on Congress, you can always make changes. I wish you wouldn't be so belligerent.

Society does change, but slowly. You would probably have said slaves will never be free, or women will never vote, or racial integration of schools will never happen.

> Gates doesn't care. Like all the other sources of funding for public sector research, he wants to be a philanthropist cheerleader for Science™.

You don't know that. Unless you've read his secret diary, or something. I suspect he does care. And I think the evidence weighs more heavily on my side of the argument. But you aren't debating the evidence; you're just claiming to know, which you obviously don't.


I'm not saying fixes will never happen - that's a strawman - just pointing out the immense scale of the challenges if you try and do it via moral suasion. It originates (IMO) in the corruption created when funders of work don't actually care about the results, just being seen to fund it i.e. governments, foundations. As such there are very few pressure points that can work because the system is so well insulated. Congress did make a small effort in the past on this (more than most countries did), back in the 70s. That's the origin of the OSI. However the effort flamed out shortly after and the OSI has been largely defunct since then, though it still exists and consumes budget.

To make progress here probably need political parties that take up academic reform (or defunding) as a voter wedge issue, and campaign on it for many years. Based on the failure of initiatives that tried to solve these issues so far, I'm very skeptical of any effort that isn't based on some sort of populism. Realistically this means any attempt to dramatically improve standards is going to get tangled up in culture war, as after all, you can't divorce an attempt to raise scientific standards from specific fields and claims (if you do it's admitting that they should be ignored). To get political momentum you need to be able to point to examples of claims that are wrong, and they have to be claims people actually care about.

Re: 60%, that's your personal estimate, right? My impression is that it varies a lot by field. In some fields you don't really get much invalid research at all, it's a curiosity. In other fields 100% of papers are useless because the underlying premises of the field are themselves wrong.

"You don't know that. Unless you've read his secret diary, or something."

I haven't read his secret diary, no. I have talked to someone who worked in epidemiology on malaria research, who described to me how the field is totally distorted by Gates Foundation funding, I've read many of his statements throughout COVID and read a skeptical review of his non-secret book (if you can get past the invective at the start the review is pretty decent):

https://www.eugyppius.com/p/we-must-find-a-way-to-prevent-bi...

... which reinforced the overall impression: Gates is a cheerleader. His approach to find the hierarchically most important people, ask them what they think and then repeat it uncritically, whilst distributing grants to more or less anyone who says they'll make Gates's personal goals come true.

It's also the case that if Gates was reading the outputs of his funded researchers and is as smart as usually claimed, he would have long ago noticed the problems. Yet his book boils down to: what we need next time is way more of all that. He is aware lots of people think Ferguson is a fraud, but thinks that's only because they were misinformed by the press. As far as Gates is concerned Ferguson is great and he repeats Ferguson's defences of his own work verbatim, even though they aren't accurate. As the reviewer points out, it's impossible to believe given what's written in his book that he ever actually read Ferguson's research (I have read it, and his model code, very carefully). Gates' one concession is that the vaccines weren't the silver bullets they were promised to be, but as for everything else - well, he acts as if he's completely unaware that any problems exist.

Now, as you observe, this might be an act. Nobody wants to spend decades lavishly funding people to engage on a noble mission and then one day admit, actually, they were mostly scamming me and we didn't get much out of it. The loss of face would be impossible to handle. Even if Gates did know, we might expect him to act as if he didn't. Still we'd hope he'd find subtle ways to improve things without outright admitting to the problem in plain language. I've never seen any evidence of this.

At least with Gates there's the theoretical possibility he could have a Damascene Conversion and start enforcing rigorous standards. With governments it really does need to become a political issue before anything can happen, as every time people try and improve standards via government, or purely internally inside academia, the new rules seem to be immediately subverted and everyone carries on as before.


I appreciate your thoughts.

I agree that pushing for cultural/political change is a hard way to go given the populism in America. I'm an American. On a personal level, I think the country is so broken that it's not worth fixing, and would like to expatriate.

> Re: 60%, that's your personal estimate, right? My impression is that it varies a lot by field. In some fields you don't really get much invalid research at all, it's a curiosity. In other fields 100% of papers are useless because the underlying premises of the field are themselves wrong.

Yes, it was my estimate. But it's just a number I threw out there. I don't disagree with what you're saying here, but I'm more focused on the structural problems that exist to some (varying) degree in every field. For example, in every field, funding is corrupted. So I wouldn't say there are some fields that don't get much invalid research at all. Outright fraud is not the only kind of "invalid" research. It's still "invalid" if it's honest research but exploring the wrong path because there's some old fart at the funding agency who has a lot of friends who are exploring that particular path, which actually has been played out for 10 years already.

> Blog post about Bill Gates

I appreciate your sharing this. I think it's easy to construct this sort of narrative. That doesn't mean it's true. It might be. I don't know enough to tell.

I really think someone like Bill Gates could see the kind of research validation institute I'm proposing as a "win." I'm not saying all the research he's funding is worthless. I don't believe that, personally. I'm just saying, we can boost the quality and effectiveness of research if we have this kind of institute. Maybe our research process as a civilization gets 2x better. To use military terminology (ugh), it's a "force multiplier." It's not saying all our existing research is garbage. It's picking off the low-hanging fruit, the worst offenders, and thereby improving the signal to noise ratio of research overall. And once you get all the worst offenders, you can look for less low-hanging fruit.

I bet Gates would be happy to admit a lot of the research he funds is "low quality." That isn't a personal indictment against him. He'd probably argue that a lot is "medium quality" and a lot is "high quality." I wouldn't disagree. I'm sure there is some proportion in all three buckets.


You're welcome, thank you for the discussion.

As someone who lives outside of America, I really wouldn't be so down on it as a country. I'm lucky to live in a very nice part of the world (Switzerland!) but even so, America is a country and culture I hugely admire. I've spent most of my career working for American firms because that's where the action is, that's where the bravest people tackle the hardest problems. Yes, America is a land of extremes and the lows can be low, but the compensation is that the highs are really high.

"For example, in every field, funding is corrupted. So I wouldn't say there are some fields that don't get much invalid research at all. Outright fraud is not the only kind of "invalid" research."

Ah, I see. Well, I'd say that the funding mechanism makes invalid research possible/easy but doesn't necessarily directly create it. The lights are out but someone still has to misbehave. In some fields there just isn't much incentive to do that, e.g. consider computer graphics or the papers that explore more efficient K/V stores or compilers. You could try and cheat in those papers but why bother? You'd just be undermining any possible future jobs in industry, so I find these papers to be pretty reasonable.

The corruption really seems to kick off in fields that can be twisted into forms of social control in some way, or where there's not much chance of ever getting a good job in the private sector. Social sciences are a great example but public health is the same problem. People see an opportunity to change the world through misrepresenting their science, they see that nobody will stop them, and so they take it. Power is the goal and the apathetic funders are the enabler. You can't change the world or control anyone via compiler research though, so it stays closer to the original ideals.

Now, I agree that if you wider the problem scope to include irrelevant research nobody cares about, then indeed every field has big problems with that. It's probably too much to ask people to care about both invalid and irrelevant research at once though.

"I bet Gates would be happy to admit a lot of the research he funds is "low quality." That isn't a personal indictment against him. He'd probably argue that a lot is "medium quality" and a lot is "high quality." I wouldn't disagree. I'm sure there is some proportion in all three buckets."

Well if you or anyone else can get him to admit that, it'd be a great start.


> You could try and cheat in those papers but why bother? You'd just be undermining any possible future jobs in industry, so I find these papers to be pretty reasonable.

I was a grad student in CS for 8 years, but left without finishing the PhD. I wasn't in compilers, but that's a reasonable example because I was in some other "niche of a niche." Small community, fairly obscure.

Most of the grad students in my niche really just wanted to be professors. And the way to become a professor was to (a) publish a very high quantity of papers; and (b) make friends with all the senior people in our little niche.

The goal people had wasn't to amass power. It was just to get a tenure-track job. Somebody else in this comment section made a joke about Chinese students. The problem is not at all limited to them. But a tenure-track job at an American university really seems like heaven to someone who's made their way up from the bottom in China, for example. And also to some people from other parts of the world, including America. If publishing tons of low-quality papers is the path to that, and being buddy-buddy with other people, they go for it. It's a "I'll scratch your back, you scratch mine" environment, including regarding "peer" review.

There is a tradeoff between quality and quantity. I was doing empirical research. I couldn't compete with the people doing more pure mathematical/algorithm stuff. They could just spit out papers with some new obscure algorithm (that will never be used anywhere) and a proof of some of its properties. I would have been lucky to have 5 published papers at the end of my grad school career, but a good tenure-track candidate would have like 30.

The senior people, like my thesis adviser, more than enable this kind of behavior. They get grant money basically based on the quantity of papers published. I never understood why my adviser cared so much about grant money. Like, what drove him to put out a super high quantity of crappy papers, to have a ton of students, and get a lot of grant money? What's the point? I never understood it. I mean, he already had tenure. And there were lots of tenured profs I knew who actually just didn't care about grant money and publication count, and didn't do that. Which is great. But you know who all the grant money goes to, and then who has a huge "lab" with lots of students? The ones like my professor who really care about that sort of thing and optimize for it.

I think compilers produces much higher quality than my niche-of-a-niche (which I don't want to name, by the way). But I don't think an area like that is immune to the pressures I'm talking about. It's probably a bigger community (which helps), where the research just has a different dynamic. We could speculate why that might be. But nonetheless I would argue that every area of modern science suffers from the problem I'm describing, to a varying extent between countries and fields and sub-sub fields.

By the way, maybe people go into compilers as PhD students, wanting to go to industry... but the truth is, in computer science, getting a PhD is usually not advancing your career over just putting that many years directly into industry. If your planned route is PhD=>industry, it only makes sense if you want to be in an industry research position and if you care about that more than how much money you make.


Perhaps you'd still have to rotate your own people because of the concept of Regulatory Capture.


You'd probably suffer a fatal accident with chinese characteristics very soon.


That made me laugh.

Much of the fraudulent work isn't coming from Chinese researchers, but much of it is.


I left academia for industry a few years ago and still get hassled every week or so for help with open source code I published in conjunction with one of my papers.

My other papers, where I provided no artifacts, generate an average of zero support emails a year but are cited just as much.

The emails are often from first year PhD. students and are vaguely accusatory, along the lines of: "We're having issues replicating experiment X and wanted to give you a chance to correct/review before we publish on it." The cause of the issues is, 100% of the time to date, people not reading documentation or having more than a superficial understanding of the topic. However, I feel compelled to respond so that my old lab doesn't get dragged on twitter or at some conference over something that isn't true.

I barely got paid to do this sort of work when I was in academia and certainly don't now. If I could go back in time, I would have kept my research closed source. The personal benefits are essentially zero and the world would get along just fine if I had never written any papers, so the altruistic reasons aren't terribly compelling either.


If it's really "ever week or so", for years, I'd be tempted to compile data on the support e-mails and their senders. At some point, you've got a meta-paper on the people who contacted you for support, breakdowns of where they are in their careers, what topics they don't seem to have mastered yet, which parts of the docs they failed to read, etc.

The fun could start when you let drafts of that meta-paper start circulating.


The cost-benefit ratio of open sourcing data has been known for long enough now, it's time to stop excusing this practice and start treating studies with unavailable datasets as heresay.

When scientists don't publish datasets, it contributes to a culture that allows a small (or not small, depending on who you talk to) portion of scientists can p-hack their way to a publication and ruin the reputation of research for everyone.


Tangential to topic, but I have been reading a Newton biography (Never at Rest) recently, and I found it interesting how reliant Newton was on the observations of other scientists. He had correspondence with scientists like Halley and Flamsteed on the basis of getting their observations of comets or observations of the moon, to help him work out orbits.

At one point, Newton was unhappy with Flamsteed, because Flamsteed wanted to work on his star map instead of Newton's moon observations, but Flamsteed said he needed an accurate star map to produce accurate moon observations.

Data sharing has been an issue in science for hundreds of years.


Some of research are fake as well. I tried to email an author who claimed some extraordinary results with their sorting algorithm and got crickets.

Suspect they dont really care beyond having it "published".


Why do researchers add these statements in the first place if they are pretty broadly not complying with them? Does it somehow help them get published just to have the statement there?


Journals are increasingly requesting that authors provide mandatory data availability statements. It doesn't necessarily mean that the data should be open, just that the authors have to declare if it is or not. Trouble is it's very difficult to prove a statement like "We intend to release our code and data as open source at <repository>", unless you hard-enforce that things are provided at publication time (and then you're assuming the reviewers will actually check, compile, etc.)


Many researchers are just cranking out papers to meet some metric, and a Data Availability Statement (DAS) is just a bit of boilerplate to make your paper seem more plausible.

I think this paper shows that it is just more research theater, meant to distract.


Some journals make these statements mandatory; for others, not promising to share your data can be a significant obstacle to publication (e.g. reviewers may ask why the data is not being made available).


There's an overused term, but in this case it could quite literally be classed as virtue-signalling?


The data availability statement in this article undermines its entire premise.

> We did not publish our raw data along with the manuscript because it could be understood that we are publicly shaming authors who did not want to share their data. As for the raw data that were received during this study, we informed our study participants that those raw data will be deleted after being examined and that all data and communication will be treated with strict confidence.


There's a big difference between explicitly denying to share data for justified ethical reasons, and saying you will share the data and then failing to follow through.

There is absolutely no contradiction here.


The concern of “shaming” is not a justified ethical reason.


I get that you don't want to share the data. Collection data is most of the work, people can steal it. God forbid people might find your wrong etc. etc.

But maybe just maybe you shouldn't say that are going to share the data if you are not. How trustworthy is the rest in the article if your Data Availability Statement (DAS) is a lie?


Some do share it but even then manage to use their own proprietary software to do the analysis, it boggles my mind how such people can receive grants and publish in prestigious journals, corruption - plain and simple.


This is simple economics.

If benefits of non compliance > consequences, don’t comply, else, comply.

That simple, and applies to tech companies as well.

At the end of the day who’s policing this stuff anyway? People only find out when there’s a leak.


An author threatened to got to the editor and then my PI became compliant.


Does this paper come with its own Data Availability Statement?


Is there irony in asking for other researchers' data (presumably free of charge) and then publishing the results of that effort behind a paywall?


[flagged]


They've explicitly justified not sharing it, for good ethical reasons.


The reasons aren't good. If the DAS posted elsewhere in this thread is accurate then they don't want to be seen as "shaming" people, but the behavior they detected is simple promise breaking. The researchers stated they'd share their data, benefited from the aura of respectability that claim created, and later refused. And this behaviour was absolutely widespread: more than 90% of the papers selected didn't provide data as promised.

That's the sort of behaviour that actually should be shamed, right? Why shouldn't people know whose word can be trusted and whose shouldn't? It's a very silly form of ethics that defends the honor of those who deliberately lie for profit. It also totally devalues the efforts and integrity of the tiny minority who did do things by the book.

The actual reason they refused to share their own data almost certainly has nothing to do with ethics. It's because they don't want to make 3500+ enemies within their field.


>The actual reason they refused to share their own data almost certainly has nothing to do with ethics. It's because they don't want to make 3500+ enemies within their field.

This is exactly why peer review is nonsense.


Maybe someone should look them up in popular search engines using simple OSINT techniques and see that else they were reckless about.

For many years, scientific computing facilities were "off limits" to honorable hackers, but I've noticed a persistent air of entitlement from the biological sciences, paired with a deep disrespect for the field of information science.

I have a bachelor's in information science with a concentrations in information security and related areas in psychology and English literature from one of the highest ranked schools of information science in the world[0], followed by a PhD I left "ABD" after numerous peer reviewed publications and scores of conference talks and other informal lectures and volunteer work.

I once saw an entire zoo aquarium's exhibits fog up in my hometown as it switched to brownouts, as a conference's attendees went on to obstruct policy initatives that would prevent the entire state they claim to want to be permanent residents from one day losing power in a similar manner to what happened in Texas.[1]

It's why I started one of my pet phrases when I'd do volunteer work:

Voting is a right, whereas electricity, like driving, is a privilege.

Anyways, sorry for the stern response, but I hate being the guy who just goes "Oh, those results are unsurprising", but some nights I feel like I really need to mansplain what can happen if folks ignore my "crazy" left-libertarian advice.

[1] After a decade of warnings, a weak mandate to "winterize" https://www.npr.org/2021/06/02/1002277720/texas-lawmakers-pa...


I don't understand the connection between the zoo's aquarium "fogging up" in your home town because of brownouts and the unwillingness of scientists to share data.

I don't understand the connection between scientific computing facilities being unwilling to host "honorable hackers" in "information science" to use their computational power and data being closed.

I don't understand what your "stern response" is to.

I don't understand why you used the word "mansplain" to describe what you're doing when it has nothing to do with identity politics.

I don't know what your "highest ranked school" is. You added it as a footnote but it doesn't seem to appear later in the comment.


It's creative nonfiction.

I'm not required to explain my art to you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: