Hacker News new | past | comments | ask | show | jobs | submit login

Because, from my understanding: it's hard to preserve things and the code is usually pretty crappy and the data is often unintelligible without a lot of help from whoever was processing it. Obviously preservation of research outputs is a noble cause, but there's very little incentive for anyone to actually document and preserve these things unless it's a requirement of the field (e.g. high-profile AI stuff). The incentive to not do that is that you save time and can jump to the next project, and also no one is pestering you about your crappy code and/or data.

Think about all of the fancy dependencies a piece of code might need to run. Think of all of the processing steps some data needed to go through. Think of the work required to set up some computational cluster without which the work just doesn't happen. Doing it once is hard enough, documenting it so anyone else can do it is maybe an order of magnitude harder.




> it's hard to preserve things and the code is usually pretty crappy and the data is often unintelligible without a lot of help from whoever was processing it

Which is a huge part of why the replication crisis is such a thing (besides the outright fraud and the publication bias). The very fact that the datasets and codebases are so disgusting is precisely why the results coming from that data can't be trusted.


> Which is a huge part of why the replication crisis is such a thing

Is it though? From what I’ve seen it is mostly caused by p-hacking/small sample sizes/poor experimental design.


Those are the causes that were picked up on initially, because they're the problems you can detect just by reading a paper without getting into the details. They're not the only causes though, just some of the most visible.

An incomplete list of other causes might contain:

• Wrong data or maths. A remarkably large number of papers (e.g. in psychology) contain statistical aggregates that are mathematically impossible given the study design, like means which can't be calculated from any allowable combination of inputs. Some contain figures that are mathematically possible but in reality totally implausible [1]

• Fraud. I've seen estimates that maybe 20% of all published clinical trials haven't actually been done at all. Anyone who tried to figure out the truth about COVID+ivermectin got a taste of this because a staggering quantity of studies turned out to exhibit disturbing signs of trial fraud. Researchers will happily include obvious Photoshops in their papers and journals will do their best to ignore reports about it [2]

• Bugs. Code doesn't get peer reviewed, sometimes not released either. The famous Report 9 Imperial College London COVID model was in development since 2004 but was riddled with severe bugs like buffer overflows, race conditions, and even a typo in the constants for their hand-rolled PRNG [3]. As a consequence their model produced very different numbers every time you ran it, despite passing fixed PRNG seeds in on the command line. The authors didn't care about this because they'd convinced themselves it wasn't actually a problem (and if you're about to argue with me on this, please don't, if I have to listen to one more academic explaining that scientists don't have to write deterministic "codes" I'll probably puke).

• Pretend data sharing. Twitter bot papers have perfected the art of releasing non-replicable data analysis, because they select a bunch of tweets on topics Twitter is likely to ban, label them "misinformation" and then in their publicly shared data include only the tweet ID, not the content. Anyone attempting to double check their data will discover that almost all the tweets are no longer available, so their classifications can't be disputed. They get to claim their analysis is replicable because they shared their data even though it's not.

• Methods that are described too vaguely to ever replicate.

And so on and so forth. There are an unlimited number of ways to make a paper that looks superficially scientific, but doesn't actually tell us something concrete that can survive being double checked.

[1] e.g. https://hackernoon.com/introducing-sprite-and-the-case-of-th...

[2] https://blog.plan99.net/fake-science-part-i-7e9764571422

[3] https://dailysceptic.org/archive/second-analysis-of-ferguson...


> [...] computational cluster [...]

I would wager that the vast majority of analyses / code, in most fields of research, can run on a basic laptop. If I had to put a number on it, I'd say >99.999% of papers.


Maybe run in a matter of days or weeks or months, but people often want the results faster than that, especially if there will be a 2nd set of analyses or a conference is coming up.


There’s no way it is that high. A remarkably high percentage of work across the sciences is “computational” these days, which involves potentially very intensive simulation.


I think you’d be surprised how many are just a halfass R script, or some disorganized SPSS/Excel or Graphpad / other proprietary software work. :P

Not in terms of proportion of data of course, but in terms of proportion of papers.


Honestly, the stuff I run on a cluster is an order of magnitude easier than the stuff that can run on a basic laptop to share.


Lol, there was one time I churned out a series of magic numbers by repeatedly running several pieces of code in an IPython notebook, measuring (with significant contribution from eyeballing a graph) and tuning the input parameters by hand each time. The magic numbers worked and were put into production at LHC (yes, the Large Hadron Collider), but good luck turning the process into reproducible code without writing a lot of new annealing code which would at least double the amount of effort.

And I was the most organized programmer among my peers by a long shot. Judging from the messiness of shared code, I can’t imagine how bad other people’s unshared work is.


Some of the HPC code that I've used is definitely "Good luck, and god speed..." if I gave you the code.

Some of it is just really clean Python code that generates a bunch of simulated data to train some machine learning code on with human readable configuration files.

But neither one of them involve the words "$COUNTRY Ministry of Health approval...", which kicks things to a whole new level.


That would be a nice problem to have. In order to fix a problem it first needs to be an actual problem rather than a theoretical one.

Maybe the goal should merely be for the author to be able replicate the results from the torrent archive (for a fee of course). That way anyone who needs the conclusion to be correct can buy the validation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: