One point: Most Bio type people have not had multi-variable calculus, and many have not had calculus at all. So, it's not really that they can't process it fast enough, it's that their techniques for processing it are stereotypical. They figured out how to do that T-Test (or something else) once, and they stick with it, because they really don't know he math behind it.
Also, though there is a TON of 'data' coming in, most of it is not useful. For example, I have a 500Gb file of a stack of .tiff images per fish that I have imaged in a confocal microscope. I have a GFP filter on the scope and therefore only get the green part of the .tiff files exposed, the red and blue are just background noise. Also, most of the image is the dish I have the fishes in. I tickle the fish, they flick their tails, and I see this all in 120fps. Now, I measure how much of an angle the fish made their tails flick, all in 3-D, because that's what the scope records in. I have a half TB per fish to comb through, and I have ~20 fishes, say ~10TB. At the end, I get a single graph comparing the fish with some gene to those without it, and I have 10TB of 'data' left over. Yeah, someone could comb through it all and find something else to look at. But i forgot to record the precise temperatures, the orientation of the fish, the fish that I knew later died, etc. I had that all in my head. And, hey, what do you know?, the p-value is ~.45 and therefore there is no 'real' difference in the fish and we can't include this in a paper. Now all that 'data' is being kept on a drive on some computer somewhere and is counted towards the budget that the lab has on the shared spaces. It's not really 'data' anymore, in that it is useful to advancing knowledge for anyone (it counts as practice I guess), but it still clogs up space.
It seems that you may have been able to extract specific information about things like "how much of an angle the fish made their tails flick" as opposed to storing the raw files. Such a technique would've definitely eliminated extra irrelevant information like "the dish [you] have the fishes in" as well as cut back on the size of the data.
True! That info is what the paper you are trying to write is all about. You take these TBs of data, and effectively compress it into a single graph of maybe 5kB, the thing you really care about.
However, you still have to keep all the data around for possible re-analysis depending on the terms of the paper you submitted to, for up to 10 years or more in some cases, along with all the equipment and maybe some frozen fish samples in LN2 down in the basement. Now, not all labs are so well funded, and sometimes accidents happen and grad students are not always so informed about these rules, but you should be keeping it all ready for re-testing for some number of years. You simply cannot get rid of it all for the sake of science.
If you really want to help out bio-peepz, then helping them program is a great way to do so. Spaghetti does not even come close to describing their 'code'. If anyone really wanted to reproduce the experiment, then they would have to wade into the fetid swamp of 'code' that produced it. Most bio people cannot even begin to tell you how to code or what it even means. They can PCR and Western Blot better than Jesus himself, but code? No way. It is a real hindrance to the sciences, actually. The helping hand that code was meant to be has become a ball-n-chain that drags peoples minds away from them and 'lets the computer do it' for them.
Wow, I failed to considered the need for validation.. but even still, that adds only another layer of testing to ensure that your techniques for converting the raw data to structured data are sound enough to be reproduced.
I'd guess the model still wouldn't be considered "pure" enough by the scientific community unless it can be proven to be free from bias samples of raw information.
> but even still, that adds only another layer of testing to ensure that your techniques for converting the raw data to structured data are sound enough to be reproduced.
That's actually not correct.
What if your validation method is broken? What if that unit test you wrote has a bug in the test itself?
In science you have to keep the raw data available.
This is not only for reproduction of your particular structured metric (i.e. tail flick angle), but for other groups to design novel metrics that may add to or trivialize the published result.
>In science you have to keep the raw data available.
There is nothing I'm aware of saying that you can't losslessly convert tiff to png or even tiff to tiff.gz. Nobody needs to collect pcap files for their sensors (the raw data). They just collect the data and stick it in an appropriate file.
Most bio-peeps and medical personnel have an understanding that there is a difference between .png, .jpg, and .tif. But the diff of .tif and .tiff? No idea, let alone the +/- of a .png to a .jpg. It's all just a picture to them.
So, trying to tell them that you can losslessly convert them, though maybe true, is not a good idea. .tiff can handle a stack of images and you only have to call it once in a MatLab script, .jpg cannot and will result in a shitfit as they try to load in an entire folder of images. Then you get into bit-depth and God help you if the conversation ever has you trying to pronounce 'int', 'float', 'str', or 'double'. The class of a variable? Dude, this person could care less and will get out a protractor and measure tail flicks off a print-out before any of that will ever sink in.
I think you underestimate people. If you're an admin doing user support for such a system and someone says they need more storage because of blah blah blah then you offer ways to alter the workflow so they don't need 500G of tiff files. It might even speed up their workflow since it's not streaming 500G off disk.
No need to get into bit depth and int float str and double (though all of the biomedial researchers I've worked with knew these very well). Just: 'you can convert to png and back again without losing data. Just like zipping a file and unzipping it again. Let me show you how...'
Totally on point, regarding multivariate relationships! The problem is not that they do T-test (and other Stat-101-jargon blackbox stuff), but that they stop at it. To many of them, even the existence of multivariate effects is beyond their imagination.
So, inference to many biomedical folks is just 1-dimensional. Big data has a long way to go to penetrate fields where people cannot think in more than 1 dimension!
The best way to get published is to use a variation on a method that was used many times before. Since the novelty you're focused on is something biological, you stick with the same statistical methods that have gotten published for the last decade.
Unfortunately, there just doesn't seem to be much tenure-juice from innovating in statistical methods for most life-science fields. Not all, of course. Science moves slowly.
Eh, look at the journals for the reasoning there. Many editors and reviewers don't have the time to untangle multi-variate effects and make certain the math was right. Maybe that was because they can't do it themselves, but usually if it is that complicated and interconnected and you come out with a 'clear' picture, odds are that it was an error. I'm NOT saying that is in fact true at all. However, most editors get so very many submissions every day that they take the easy to understand papers with clear results and pathways over the harder to understand ones that are more likely to have errors in them as they are more complicated. This feeds backwards into the grad schools and then only simple pathways and mechanisms are encouraged and nurtured, while complex ones are left alone in the publish-or-perish environment. And yes, this is BS, but that is how grants get funded.
I touch on this in another comment, but mostly because bio people do not understand math and therefore programming. Machine learning is light-years beyond them as a result. I'll give an example:
I was at an anesthesiology conference. Dr. Emery Brown is at Harvard and Mass. Gen. and has been triply elected to the Nat'l Academy in Eng., Medicine, and Biology (only 19 other people hold that record). Suffice to say, the guy is Smart with a capital S. He was talking about his new auto-anesthesia machine that records EEG on the head and them modulates the dosage of the anesthesia drugs to maintain or change the depth of anesthesia. It keeps you 'knocked out' better than any human can and will bring you 'back up' probably a lot better too (more testing is needed). Very basically, when knocked out for surgery, your entire brain rhythmically fires at ~11Hz. As you wake up, that rhythm deceases and goes away, the deeper you are, then your brain increases that rhythm. You measure that with the EEGs and filter out all the rest. So, to keep you knocked out, you increase the dosage when you see the rhythm slowing and decrease when you see the rhythm speeding up.
Anyone that has taken differential equations and the barest EE knows that the way to control that is with an Op-Amp circuit (https://upload.wikimedia.org/wikipedia/commons/f/fb/Voltage_...). It is incredibly straightforward if you were even barely awake in those classes. You don't need something as complicated as Machine Learning to do it, you just use feedback.
Now, when the Q&A session started with Dr. Brown, it was a mad-house. The anesthesiologists and neuroscientists were just dumbfounded that a simple circuit like that could control the machine with total clarity and reliability. No explaining by a guy with that level of gravitas and a credential list longer than your arm could convince them that it would work. The phrase they kept coming up with was 'I'm not a math person'.
Ok, got it? These folks are brilliant in bio and neuro and surgery. Just flabbergastingly good. They can feel how much drug is needed for a baby that just got their arm ripped off in a car accident. But they will never understand the math and they do not trust it at all as a consequence.
So trying to say that Machine Learning is the pancea to this flood of data is like saying to Eskimos that to heat their houses they need to simply invent a nuclear reactor. It's never going to happen.
>I touch on this in another comment, but mostly because bio people do not understand math and therefore programming.
This is extremely condescending.
I know many brilliant biologists who are also great mathematicians, statisticians, and programmers.
Anesthesiology isn't really a bio field at all. It's a medical field. A field that doesn't do a lot of cutting edge research. What you describe is engineering, and is likely headed by competent engineers, and funded by said Anesthesiologist.
>But they will never understand the math and they do not trust it at all as a consequence.
Again. Extremely condescending.
The people who researched, designed, and built the machine you describe are likely extremely competent in math. Yet, you are describing the end-user, and from that inferring the designer of the machine to be of the same expertise as the user.
Balgair was responding to me. I'm a pathologist with an undergrad in physics. Balgair's assumptions are well founded. It's like trying to convince horsemen to buy this new carriage an otto cycle engine. Maybe one of those horsemen will have an engineering degree. But it will be decades before all the horses are off the road.
In the ML world, yes. But the biologists know how to do the experiments, most of which is raising fish and looking at them under a confocal microscope. Give them an effectively infinite supply of confocal microscopists and they could raise a lot of fish.
I'm a biologist by training. Eventually my research hit a data wall (my simulations produced too much data for my storage and processing system). I had read a paper on GFS and Mapreduce and Bigtable from Google, and decided to go work there. I got hired onto an SRE (production ops) team and spent my 20% time learning how production data processing works at scale.
After a few years I understood stuff better and moved my pipelines to MapReduce. And I built a bigger simulator (Exacycle). It was easy to process 100+T datasets in an hour or so. It wasn't a lot of work, really. We converted external data forms to protobufs and stored them in various container files. Then we ported the code that computed various parameters from the simulation to MapReduce.
I took this knowledge, looked at the market, and heard "storing genomic data is hard". After some research, I found that storing genomic data isn't hard at all. People spend all their time complaining about storage and performance, but when you look, they're using tin can telephones and wind up toy cars. This is because most scientists specialize in their science, not in data processing. So, based on this I built a product called 'Google Cloud Genomics' which stores petabytes of data (some public, some private for customers). Our customers love it- they do all their processing in Google Cloud, with fast access to petabytes of data. We've turned something that required them to hire expensive sysadmins and data scientists into something their regular scientists can just use (for example, from BigQuery or Python).
One of the things that really irked me about genomic data is that for several years people were predicting exponential growth of sequencing and similar rates of storage needs. They made ludicrous projections and complained that not enough hard drives were made to store their forthcoming volumes. oh, and the storage cost too much, too. Well, the reality is that genomic data doesn't ahve enough value to archive it for long times (sorry, folks, for those that believe it: your BAM files don't have value enough for you to pay the incredibly low rates storage providers charge! Also, we can just order more harddrives, Seagate just produces drive to meet demand, so if there is a real demand signal and money behind it, the drives will be made. Actual genomic data is tiny compared to cat videos.
The real issue is that most researchers don't have the tools or incentives to properly collect, store, and use big data. Until that is fixed, the field will continue in a crisis.
Question from ignorance: how do you get "petabytes of data" into the Google Cloud in a reasonable time? I find copying a mere few TB can take days and that's on a local network not over the internet.
I don't work in this specific field, but did previously, during the first decade of this century, in broadcast video distribution.
At the time, UDP based tools such as Aspera[1], Signiant[2] and FileCatalyst[3] were all the rage for punting large amount of data over the public Internet.
Aspera, is the current winner in Bioinformatics. The European Bioinformatics Institute and US NCBI are both big users of it. Mainly for INSDC (Genbank/ENA/DDBJ) and SRA (Short Read Achive) uploads.
For UniProt a smaller dataset we just use it to clone servers and data from Switzerland to the UK and US at 1GB/s over wide area internet.
Jim Kent wrote a small program parafetch - basically an ftp client that parallelized uploads. It worked reasonably well for speeding things up maybe 10x. You can get it somewhere on the UCSC web site in his software repository, though it involves compiling the C code.
Tannenbaum always forgot to include the time writing and reading tapes. Typical 10TB hard drives (which most people use for data interchange instead of tapes) only have ~100MB/sec bandwidth (~ same as 1Gbit NIC).
I have worked with biologist in the past and tried to show them how to improve their data processing - for example sticking things in a database rather than clog the network file system with millions of short files. The majority don't seem to take any interest.
you should have told them they'd publish twice as many papers in high priority journals if they could improve their data processing. then show them their competitor's paper where they did just that.
Can you say what the breakdown is between happy "institutional" (read: universities, research institutes, etc) and "industry" (read: private companies) customers? This seems great, except that federal grant funding used to make it really hard to use stuff like that.
It seems like there's good opportunity for skilled data scientists and engineers to make a real difference here. I do think that laypersons (to both medicine and engineering) think that practitioners in medicine and biology have mastered such mundane things like data pipelines, because you have to be so smart to be in medicine/biology, but my limited experience has been more along the lines of what Neil Saunders describes as the inspiration for his coding+bioinformatics blog:
> You may be wondering about the title of this blog.
> Early in my bioinformatics career, I gave a talk to my department. It was fairly basic stuff – how to identify genes in a genome sequence, standalone BLAST, annotation, data munging with Perl and so on. Come question time, a member of the audience raised her hand and said:
> “It strikes me that what you’re doing is rather desperate. Wouldn’t you be better off doing some experiments?”
> It was one of the few times in my life when my jaw literally dropped and swung uselessly on its hinges. Ultimately though, her question did make a great blog title.
edit: To add an anecdote that I believe I read on HN; regarding the topic of the huge datamine of DNA and other health data provided by the U.S. government, a commenter said that the reason it was all on FTP was because professors couldn't download large datasets via their web browser, or some such technical hiccup.
I won't say that putting data on the Web makes it automatically more accessible, but data discovery through FTP requires a bit of scripting skill that I imagine the average biomedical scientist does not have.
I agree--the impact of a great engineer working in healthcare is very high, particularly if you partner with medical experts.
We're a small startup that has partnered with UCSF Cardiology to detect abnormal heart rhythms, and other conditions, using deep learning on Apple Watch heart rate data:
We have about 10B sensor data points so far. If you're a machine learning engineer and interested in working on this type of problem, feel free to email me: brandon@cardiogr.am.
In our case, we're applying deep learning to sensor data, so much of the day-to-day work of a machine learning engineer is experimenting with new neural architectures rather than feature engineering by hand. For example, we're using or interested in techniques like:
* semi-supervised sequence learning (we have a paper in a NIPS workshop next week on applying sequence autoencoders to health data, for example)
* deep generative models
* variational RNNs
From a day-to-day perspective, we use tools like Tensorflow and Keras, similar to most AI research labs. In general, we try to act as a software startup that happens to work in healthcare, rather than as what you might think of as a traditional biotech or medical device startup.
>It seems like there's good opportunity for skilled data scientists and engineers to make a real difference here.
The logistics of dealing with big data is a solved problem in biology. We don't need data fast, and a laptop can run most analyses now, so it's just storage really.
The hard part is having familiarity with all the esoteric metrics used to compare genomes and all the unfamiliar ways your results can be biased.
An off-the-shelf corporate data scientist, while a critical thinker, lacks the domain specific knowledge to be able to ask good questions.
> An off-the-shelf corporate data scientist, while a critical thinker, lacks the domain specific knowledge to be able to ask good questions.
Yeah, I think that's the other side of the coin here (for biomedicine and most other fields). Even if better engineering is what is most greatly needed, it's not just about just being purely good at code and logistics. As a corollary, if you do have the domain knowledge and are in a position to act, you don't have to be an ace programmer to make a huge difference.
>edit: To add an anecdote that I believe I read on HN; regarding the topic of the huge datamine of DNA and other health data provided by the U.S. government, a commenter said that the reason it was all on FTP was because professors couldn't download large datasets via their web browser, or some such technical hiccup.
All data on the sequence read archive (SRA), the main worldwide database for sequencing reads from NCBI, is accessible via FTP and via ascp. FTP is slow, ascp is blazing fast.
Very specifically, with actual knowledge: NCBI insisted on FTP (and still does) because a single prominent professor downloaded a file from web and IE truncated it at 2GB (known bug). Said professor threw a shitfit. This IE bug hasn't been a problem for ~10 years.
Not just that. FTP sites have tendency to not fail. HTTP downloads are really hard to keep up at 2gb+ over someones wifi. Dropped connections are quite normal and doing the correct resume work is possible but not trivial in practice.
So for files FTP is much better. Especially if the other side is just a script.
This is just false. What evidence do you have? HTTP was architected for availability (load balancing across many servers). Also, FTP has two TCP connections per copy, doubling the chance of abrubt termination.
I've moved petabytes over the internet using FTP and HTTP. HTTP wins hands-down.
Practical experience in such an institute providing such data. FTP servers for files just run, http servers tend to do more things than just serving files. This could be separated out of course but in _practice_ they aren't. Setting up the http server right and not touching them is the hard part, not the actual http protocol part.
i.e. I do accept that http is a much better protocol than FTP but that social and organisational reasons lead to FTP being more stable and dependable in the field for large file download than http servers.
so your arugment is "http is more flexible so people misconfigure their servers and that affects availability".
that's a server config issue, nothing to do with either of the systems. If you're setting up a CDN (which is what these genomic servers are) you just configure the servers to serve files, nothing else.
My hundreds of aborted recursive FTP fetchs compared to my almost-never aborted recursive HTTP show that anything you're seeing about FTP being more stable is just a PEBKAC issue.
Yeah, that happens when you prioritize buying sequencers and building genome centers while pretending that analysts grow on trees.
The post-docs and graduate students who do the heavy lifting on all of these projects don't make a living wage.
They can't raise a family, buy a house, or save for the future. The people in charge made them indentured servants and now those leaders are going to reap the whirlwind.
Yep. Money is strangely spent. Incentives and historical attitudes of the field regarding money are hard to change.
$500k for a new microscope, no problem! It's a fancy, four channel live microscope. So it takes four 1Mb images each frame, and you're running a 10min long experiment taking an image 5 times per second. So that's like 12Gb per imageset. And you take like 10/15 replicates per experiment under 3-5 different conditions. That data for that experiment which under-girds a 6-7 figure grant is now stored on a $100 3TB USB disk from Best Buy. Oh, and trying to process that 12Gb image over USB2.0 using MatLab on a student's personal macbook air is horribly inefficient - but there is no other option for the student really.
The students collecting the data and storing it on their local HD or laptop hard drive have no place to archive their data even if they wanted to. There are no repositories capable of generically storing that kind of huge data that needs to be frequently accessed at the price the students/labs are willing to pay (nothing).
And this speaks nothing about the code every student reinvents in MatLab to do basic scientific analysis. Or worse, does not reinvent and instead reuses 20-year old code written by some long-forgotten student who wanted to try their hand at a 'new' programming language like IDL.
The 'students' and postdocs are paid nearly minimum wage to do the high-tech biomedical research. There are no computer-scientists to be seen because they would be fools to give up making 5-8x money across the street at Twitter.
On the other hand, the scientists know their work really well, and it will take a truly integrated team to solve these issues. A computer scientist can't just come over for a day and write up an app to help out. The code will have to make scientific assumptions and must be custom for many/most projects. But it's very hard to build a capable team when the market salary for certain kinds of team members is multiples of another on the same team.
Pretty much what you said. I majored in Biology and took most of the CS track at my school, with the intention of going into bioinformatics. I gave up on that plan, because the money was terrible. To make a reasonable amount of money in bioinformatics you need to have a PhD and be the person in charge of the grant. At that point, you'll still probably be making less money than anybody who is doing CRUD applications for any medium sized organization.
Huge shame, because I love biology. I'd love to work on these projects, but I'm not going to get a PhD just so I can make what I'm making now.
As a tech worker in research, I think my salary was at least 50% higher than many of the "staff scientists" and postdocs. This was a huge problem, as you can imagine.
80% of postdocs I ran into came from either China or Russia. Almost all of them were "disposable" from our point of view. Long hours, low pay, little reward. Despite reaching out to them to build friendships, it was incredibly rare for them to do anything but socialize in their small postdoc circles consisting of people from the same province in China or whatever.
The best part is, from their point of view, they were here to take advantage of our "advanced" research infrastructure until they had enough experience to duplicate it back home. Fair trade, I think.
I was offered a coop position at the Singapore Genomics Institute this summer, but I turned it down because the pay was truly terrible. I would've been paid 1500 Sinagpore dollars a month IF I was able to successfully apply for the A-star scholarship, otherwise it would've been only 1000. I was seriously consider taking the offer, because I believe the experience would've been very rewarding, especially since I don't know a lick of biology, and was hired because of my quantatitve toolkit and modelling experiences, but declined when I found out they don't even pay for the plane ticket.
I've heard about this a lot from people I know who work in healthcare. It seems that one could make a successful business simply by hiring a bunch of data scientists to offer analytics and data processing services for healthcare, but what's preventing that? Is there a lack of expertise, funding, too much regulation, or something else?
- Real, bespoke biomedical analysis is not trivial in effort, cost, or time. There are biomedical analysis systems-in-a-box (look at https://galaxyproject.org), but that's just canned analysis. To make real breakthroughs, you need rigorous analysis that requires years of experience to be able to perform.
- It's easier to get the money to collect the data than it is to effectively steward the data you collect. In a past life, I ran a biomedical research computing facility, and everyone got plenty of money for new sequencers, mass specs, and other fancy instruments. They got plenty of money for collecting all kinds of data. No one would ever add money to their grants to actually STORE the data. They would literally put the data on USB hard drives bought from Best Buy, and left them in file cabinets and on desks. There was absolutely nothing I could do about this, and so I quit.
- Research is balkanized to hell. Even though I ran the scientific computing for 20 research labs, each research lab was its own fiefdom. They could decide to obey or disobey my policies at will, since they controlled their own funding. You can imagine what happened when I proposed turning on quotas (~100TB per lab, to start!). Rather than work with my team to determine how to share resources, people would just jump off my high speed facility, buy a shitty cheap JBOD from Dell for their analysis, and store their archives on shitty cheap USB hard drives from Best Buy. The funniest part was that if the hard drive failed, and the data couldn't be restored, in theory the primary investigators could get into real legal trouble. No one seemed to worry.
There are a few biomedical research institutes that "get" scientific data stewardship - Broad, Scripps, but for the most part, biomedical research computing is a total clusterfuck and I couldn't have gotten out of there fast enough for the way saner land of tech companies.
This describes the earth sciences about 5 yrs ago. People (around here at least) are seeing the light. A big driver of the change is the emphasis on data management and archiving that is coming from the NSF. Many research programs have to make data available post-publication or risk cutting off funding. Not sure if this is yet the case in the health sciences.
I agree. A lot lab do not want to hire good developer or data scientist. Or they do not have the money to hire, even they spend thousands in data collecting.
I believe money is definitely a big part of it. If you have the skills needed to help manage and analyze "big" data (big as in too big to realistically handle in Excel, which is the limit of most biologists), you can easily earn much more somewhere else.
Partially. I worked with bioinformatics labs until recently. Career progression is limited as they treat a software engineer as a technician, and nothing more. They don't appreciate the value you bring unless you are publishing papers (certainly in the last two institutes I worked in).
It sort of does already. Not-for-profit organizations like the San Diego Supercomputing Center act as Biomedical-Research-IT-As-A-Service providers, but there's so much competition for grant funding, that if you can get away with doing things cheaper, you will.
The scariest part was that before I left, I built a highly scalable, long-term archive for scientific data built on LTO tapes that would allow ridiculously cheap (basically the cost of LTO tapes) on-line and near-line storage. When I left, no one wanted to bother with paying for the upkeep of the hardware, people got bored with swapping tapes, and it eventually died. Oh well. Your tax dollars at work.
This is such an interesting (and kind of worrying) problem. It's one of those problems where the obvious solution is only solving the obvious problem. As you have pointed out there is a more underlying problem that hides underneath and which you only know of if you know the well enough.
I need to think about this some more to understand why that is and why an incentive structure can't be created.
You gave me a new perspective on things today, thanks for that.
I built in an incentive structure. Each grant includes a certain amount of "overhead" in addition to what the researcher requests, which keeps the lights on, pays for common infrastructure. I negotiated to fund a big portion of ongoing costs for data storage out of the overhead. (Capital costs were mostly paid out of funds from a large settlement, private grants, etc.) We found that when there were no limits to usage, researchers would just duplicate their data over and over with tiny changes, which was incredibly costly.
To solve this problem, I tried giving "monopoly money" to the professors, allowing them to trade data storage and cluster time for favors, analysis, and so on. For the researchers who didn't need as much storage, they could give their excess up. For those who gobbled up storage, they could "buy" the excess. It ended up failing because I didn't have backup from leadership to say "no" when I was asked to do things that were irresponsible:
Yes, you can buy a 2TB HDD for $100. No, that isn't the same as 2TB of storage on an enterprise-level storage array, clustered, with local mirroring, offsite tape backup, etc. No, I won't plug your 2TB USB HDD into my compute cluster.
Thanks again, I was more wondering about the incentive structure for who (i.e not the professors). Even assuming its costly could someone benefit enough that they would have no problem paying those money. Could it be built into the grants structure etc.
There are businesses like that already but most (not all) act as consulting services. Beyond regulations, there are three main problems.
1. Each organization has their own data silo and treats it as if it were gold. Therefore it is difficult and expensive to aggregate.
2. Because the data is sensitive, there are often many contractual restrictions beyond HIPAA on how the data can be used. Anything from research purposes only to you can build a commercial product but you can only sell it to the data owner.
3. Even if an organization wanted to share data with reasonable restrictions and pricing, it is often hard to do so because their focus isn't software. So it is difficult or impossible for them to share it.
Source: I made a serious attempt at a healthcare data analytics startup. It didn't work out.
The simple answer is that there aren't many incentives to spend money analyzing the data: On a per-patient basis, the benefits of this sort of analysis are very hit-and-miss and hard to quantify...
In the modern world we promote a "one size fits all" health model that is encouraged by the many third parties involved in the doctor patient relationship (third parties such as insurance companies, employers and government regulators). It's going to be very difficult to adopt this model to the patient-tailored healthcare that is going to be required to fully leverage the recent advances in machine learning.
Unfortunately it's not a simple endeavor. (a) For healthcare data specifically, the data is sensitive. You need to follow HIPAA and usually 21 CFR 11 regulations, and you always face potentially high liability in case of breach. (b) In part for that reason, it becomes expensive. Even for non-healthcare biomedical data (research-only data), most academic labs will not or cannot pay outside firms to do the work, regardless of whether it would be done better or faster.
The HIPAA and other regulations aren't any more annoying than any other modern programming practices these days. As for liability in the case of a breach, that's what E&O insurance is for. It does raise the barrier of entry a little, but it's not by any means prohibitive.
I think the bigger issue is what format this data is in. Most medical and medical records data is in a variety of proprietary, non-open and difficult to integrate technologies. One look at HL7 is usually enough to send a programmer back into the loving embrace of anything else.
Working with this data is time intensive and expensive and I'm guessing most heathcare companies don't see it as worth the cost.
> regulations aren't any more annoying than any other modern programming practices these days
As someone in the field I beg to differ :). And regardless of who pays in the event of a breach, the effects may be sufficient enough to shut down the company for future projects.
Can you expand on that? Everyone I've spoken with about HIPAA say that most companies don't even bother to comply and that nobody is enforcing it. Then again, these people work for companies that are small enough they've never had a breach.
That's rather scary to hear, and I can't imagine that they manage to secure access to any of the major datasets e.g. as contractors for hospitals or insurance companies.
You can basically self-certify, but most serious companies will bring in an outside contractor on an ongoing basis to certify compliance. Staff needs to be trained, computers need to be managed, software changes have to be very thoroughly reviewed, updates become slow. It makes it pretty unattractive to enter into for a lot of devs.
> Working with this data is time intensive and expensive and I'm guessing most heathcare companies don't see it as worth the cost.
It's not worth the cost 99% of the time. The only reason we're still making real breakthroughs is because of the research institutes doing basic research on public funding. Even then, they're swinging for the low hanging fruit.
Researchers are aggregating data. There's all kinds of ways you as a layperson can mash up genomic data, and it's easier than ever. The data that's most accessible are well-defined genomic sequences, not cutting edge stuff. Much of what 23 and Me is doing is stems off of these publicly funded datasets.
In many cases scientists are required to submit data to journals when they publish. However, data is a funny term! Using genomics data, there are many levels of data that you'll encounter - everything from raw image files (many gene sequencers are actually automated digital cameras, taking pictures of florescent markers attached to the DNA strands) to intermediate sequence files to formatted pieces of selected data. Then, there's the whole toolchain used to go from sample to formatted final analysis? What's required to be submitted? How long should it be kept? Who checks all this to make sure it's not NES roms or Shakespeare instead of the right data?
Finally, there's the big questio: how can we be sure that the data captured in the intermediate or final steps of analysis actually originates from the raw data? Should scientists store the raw data (TBs and TB), intermediates (GBs and GBs), or final analysis (MBs). For how long?
To answer your question about meaningful long term research - I've personally seen grad students' careers effectively ruined due to shitty data storage hygiene.
Try to make sure you aren't just helping people publish papers for the sake of it. As soon as you sense that, stop working with them. It is a very, very bad thing.
In Earth Sciences, Astronomy and presumably quite a few other fields there is a staggering amount of data coming in and the number of people who have the required domain knowledge, math knowledge and programming knowledge is not growing as fast as the data is. Teams can and do help, but well, there is lots of work out there.
This isn't very surprising to me. Data is relatively cheap and the (effective) analytics of it is where the value lies. This typically the case, isn't it?
That said, this is a good reminder that data scientists are in high demand and can make a difference.
If you're interested in working on an open source project involving big data, machine learning and cancer, check out cognoma.org. It is sponsored by UPenn's greenelab.com
The Harvard Personal Genome Project Over 200 whole genomes and I think over 500 genotyping data (23andMe and the like) released under a CC0 license with sporadic phenotype data [1]. Open Humans [2] has a bunch of data with a convenient API [3]. OpenSNP has a lot of genotyping data (23andMe etc.) available for download [4].
For a more comprehensive list check out one of the many "Awesome Public Datasets" [5] (biology section).
Incidentally, Arvados (http://arvados.org) is the software used to host the Harvard PGP data, and is a free software platform for managing large scale storage and analysis aimed at scientific workloads.
Waston can only categorize what is already known about genomes given existing research.
What we need is to have people who can ask questions and think critically. Good, hypothesis driven science is how we discover entirely new concepts and mechanisms.
If that much data is being collected it's time to start asking what we're looking for within it, and if the rate of collection, retention, and range if inputs is worth it.
Maybe it only makes sense to store sections which deviate in a significant way from a range of error (lossy compression).
Maybe some of those inputs just don't make sense for the questions being asked.
A concrete reasoning for why the data should be kept needs to be presented, and THAT is what should call for the funding to back that need.
Also, though there is a TON of 'data' coming in, most of it is not useful. For example, I have a 500Gb file of a stack of .tiff images per fish that I have imaged in a confocal microscope. I have a GFP filter on the scope and therefore only get the green part of the .tiff files exposed, the red and blue are just background noise. Also, most of the image is the dish I have the fishes in. I tickle the fish, they flick their tails, and I see this all in 120fps. Now, I measure how much of an angle the fish made their tails flick, all in 3-D, because that's what the scope records in. I have a half TB per fish to comb through, and I have ~20 fishes, say ~10TB. At the end, I get a single graph comparing the fish with some gene to those without it, and I have 10TB of 'data' left over. Yeah, someone could comb through it all and find something else to look at. But i forgot to record the precise temperatures, the orientation of the fish, the fish that I knew later died, etc. I had that all in my head. And, hey, what do you know?, the p-value is ~.45 and therefore there is no 'real' difference in the fish and we can't include this in a paper. Now all that 'data' is being kept on a drive on some computer somewhere and is counted towards the budget that the lab has on the shared spaces. It's not really 'data' anymore, in that it is useful to advancing knowledge for anyone (it counts as practice I guess), but it still clogs up space.