I think I disagree with most of the comments here stating it’s premature to give the Nobel to AlphaFold.
I’m in biotech academia and it has changed things already. Yes the protein folding problem isn’t “solved” but no problem in biology ever is. Comparing to previous bio/chem Nobel winners like Crispr, touch receptors, quantum dots, click chemistry, I do think AlphaFold already has reached sufficient level of impact.
It also proved that deep learning models are a valid approach to bioinformatics - for all its flaws and shortcomings, AlphaFold solves arbitrary protein structure in minutes on commodity hardware, whereas previous approaches were, well, this: https://en.wikipedia.org/wiki/Folding@home
A gap between biological research and biological engineering is that, for bioengineering, the size of the potential solution space and the time and resources required to narrow it down are fundamental drivers of the cost of creating products - it turns out that getting a shitty answer quickly and cheaply is worth more than getting the right answer slowly.
AlphaFold and Folding@home attempt to solve related, but essentially different, problems. As I already mentioned here, protein structure prediction is not fully equivalent to protein folding.
Yeah, this is what I mean by "a shitty answer fast" - structure prediction isn't a canonical answer, but it's a good enough approximation for good enough decision-making to make a bunch of stuff viable that wouldn't be otherwise.
I agree with you, though - they're two different answers. I've done a bunch of work in the metagenomics space, and you very quickly get outside areas where Alphafold can really help, because nothing you're dealing with is similar enough to already-characterized proteins for the algorithm to really have enough to draw on. At that point, an actual solution for protein folding that doesn't require a supercomputer would make a difference.
> this is what I mean by "a shitty answer fast" - structure prediction isn't a canonical answer
A proper protein structural model is an all-atom representation of the macromolecule at its global minimum energy conformation, and the expected end result of the folding process; both are equivalent and thus equally canonical. The “fast” part, i.e., the decrease in computational time comes mostly from the heuristics used for conformational space exploration. Structure prediction skips most of the folding pathway/energy funnel, but ends up at the same point as a completed folding simulation.
> At that point, an actual solution for protein folding that doesn't require a supercomputer would make a difference.
Or more representative sequences and enough variants by additional metagenomic surveys, for example. Of course, this might not be easily achievable.
> ends up at the same point as a completed folding simulation.
Well, that's the hope, at least.
> Or more representative sequences and enough variants by additional metagenomic surveys, for example. Of course, this might not be easily achievable.
For sure, but for ostensibly profit-generating enterprises, it's pretty much out of the picture.
I think the reason an actual computational solution for folding is interesting is that the existing set of experimentally verified protein structures are for proteins we could isolate and crystalize (which is also the training set for AlphaFold, so that's pretty much the area its predictions are strongest, and even within that, it's only catching certain conformations of the proteins) - even if you can get a large set of metagenomic surveys and a large sample of protein sequences, the limitations on the methods for experimentally verifying the protein structure means we're restricted to a certain section of the protein landscape. A general purpose computationally tractable method for simulating protein folding under various conditions could be a solution for those cases where we can't actually physically "observe" the structure directly.
Most proteins don't fold to their global energy minimum- they fold to a collection of kinetically accessible states. Many proteins fail to reach the global minimum because of intermediate barriers from states that are easily reached from the unfolded state.
Attempting to predict structures using mechanism that simulate the physical folding process waste immense amount of energy and time sampling very uninteresting areas of space.
You don't want to use a supercomputer to simulate folding; it can be done with a large collection of embarassingly parallel machines much more cheaply and effectively. I proposed a number of approaches on supercomputers and was repeatedly told no because the codes didn't scale to the full supercomputer, and supercomputers are designed and built for codes that scale really well on non-embarassingly parallel problems. This is the reason I left academia for google- to use their idle cycles to simulate folding (and do protein design, which also works best using embarassingly parallel processing).
As far as I can tell, only extremely small and simple proteins (like ribonuclease) fold to somewhere close to their global energy minimum.
Except, you know, if you're trying to understand the physical folding process...
There are lots of enhanced sampling methods out there that get at the physical folding process without running just vanilla molecular dynamics trajectories.
> It also proved that deep learning models are a valid approach to bioinformatics
A lot of bioinformatics tools using deep learning appeared around 2017-2018. But rather than being big breakthroughs like AlphaFold, most of them were just incremental improvements to various technical tasks in the middle of a pipeline.
and since a lot of those tools are incremental improvements they disappeared again, imho - what's the point for 2% higher accuracy when you need a GPU you don't have?
Not many DL based tools I see these days regularly applied in genomics. Maybe: Tiara for 'high level' taxonomic classification, DeepVariant in some papers for SNP calling, that's about it? Some interesting gene prediction tools coming up like Tiberius. AlphaFold, of course.
Lots of papers but not much day-to-day usage from my POV.
Most Oxford Nanopore basecallers use DL these days. And if you want a high quality de novo assembly, DL based methods are often used for error correction and final polishing.
There are a lot of differences between the cutting-edge methods that produce the best results, the established tools the average researcher is comfortable using, and whatever you are allowed to use in a clinical setting.
AlphaFold doesn’t work for engineering though. Getting a shitty answer ends up being worse than useless.
It seems to really accelerate productivity of researchers investigating bio molecules or molecules very similar to existing bio molecules. But not de novo stuff.
that's just not true. In a lot of cases in engineering, there are 10000000 possibilities, and deeplearning shows you 100 potentially more promising ones to double check, and that's worth huge amounts of money.
In a lot of cases deep learning is able to simulate complex system at a precision that is more than precise enough, that otherwise would not be tracktable (like is the case with alphafold), and again this is especially valuable if you can double check the output.
Ofc, in the field of language and vision and in a lot of other fields, deep learning is straight up the only solution.
Eh, in many cases for actual customer-facing commercial work, they're sticking remarkably close to stuff that's in genbank/swissprot/etc - well characterized molecules and pathways, because working with genuinely de novo stuff is difficult and expensive. In those cases, Alphafold works fine - it always requires someone to actually look at the results and see whether they make sense or not, but also "the part of the solution space where the tools work" is often a deciding factor in what approach is chosen.
Agreed. There are too many different directions of impact to point out explicitly, so I'll give a short vignette on one of the most immediate impacts, which was the use in protein crystallography. Many aspiring crystallographers correctly reorganized their careers following AlphaFold2, and everyone else started using it for molecular replacement as a way to solve the phase problem in crystallography; the models from AF2 allowed people to resolve new crystal structures from data measured years prior to the AF2 release.
I agree that it’s not premature, for two reasons: First, it’s been 6 years since AlphaFold first won CASP in 2018. This is not far from the 8 years it took from CRISPR’s first paper in 2012 to its Nobel Prize in 2020. Second, AlphaFold is only half the prize. The other half is awarded for David Baker’s work since the 1990s on Rosetta and RoseTTAFold.
I agree. For those not in biotech, protein folding has been the holy grail for a long time, and AlphaFold represents a huge leap forward. Not unlike trying to find a way to reduce NP to P in CS. A leap forward there would be huge, even if it came short of a complete solution.
> Let me get the most important question out of the way: is AlphaFold’s advance really significant, or is it more of the same? I would characterize their advance as roughly two CASPs in one
Crispr is widely used and there are even therapies approved based on it, you can actually buy TVs that use quantum dots and click chemistry has lots of applications (bioconjugation etc.), but I don't think we have seen that impact from AlphaFold yet.
There's a lot of pharma companies and drug design startups that are actively trying to apply these methods, but I think the jury is still out for the impact it will finally have.
AlphaFold is excellent engineering, but I struggle calling this a breakthrough in science. Take T cell receptor (TCR) proteins, which are produced pseudo-randomly by somatic recombination, yielding an enormous diversity. AlphaFold's predictions for those are not useful. A breakthrough in folding would have produced rules that are universal. What was produced instead is a really good regressor in the space of proteins where some known training examples are closeby.
If I was the Nobel Committee, I would have waited a bit to see if this issue aged well. Also, in terms of giving credit, I think those who invented pairwise and multiple alignment dynamic programming algorithms deserved some recognition. AlphaFold built on top of those. They are the cornerstone of the entire field of biological sequence analysis. Interestingly, ESM was trained on raw sequences, not on multiple alignments. And while it performed worse, it generalizes better to unseen proteins like TCRs.
The value in BLAST wasn't in its (very fast) alignment implementation but in the scoring function, which produced calibrated E-values that could be used directly to decide whether matches were significant or not. As a postdoc I did an extremely careful comparison of E-values to true, known similarities, and the E-values were spot on. Apparently, NIH ran a ton of evolution simulations to calibrate those parameters.
For the curious, BLAST is very much like pairwise alignment but uses an index to speed up by avoiding attempting to align poorly scoring regions.
BLAST estimates are derived from extreme value theory and large deviations, which is a very elegant area of probability and statistics.
That's the key part, I think, being able to estimate how unique each alignment is without having to simulate the null distribution, as it was done before with FASTA.
The index also helps, but the speedup comes mostly from the other part.
Well I'm sure one could look at number of published papers etc, but that metric is a lot to do with hype and I see it as a lagging indicator.
A better one is seeing my grad-school friends with zero background in comp-sci or math, presenting their cell-biology results with AlphaFold in conferences and at lab meetings. They are not protein folding people either- just molecular biologists trying to present more evidence of docking partners, functional groups in their pathway of interest.
It reminds me of when Crispr came out. There were ways to edit DNA before Crispr, but its was tough to do right and required specialized knowledge. After Crispr came out, even non-specialists like me in tangential fields could get started.
In both academic and industrial settings, I've seen an initial spark of hope about AlphaFold's utility being replaced with a resignation that it's cool, but not really useful. Yet in both settings it continued as a playing card for generating interest.
There's an on-point blog-post "AI and Biology" (https://www.science.org/content/blog-post/ai-and-biology) which illustrates why AlphaFold's real breakthrough is not super actionable for creating further bio-medicinal applications in a similar vein.
That article explains why AI might not work so well further down the line biology discoveries, but I still think alphafold can really help with the development of small molecule therapies that bind to particular known targets and not to others, etc.
The thing with available ligand + protein recorded structures is that they are much, much more sparse than available protein structures themselves (which are already kinda sparse, but good enough to allow AlphaFold). Some of the commonly-used datasets for benchmarking structure-based affinity models are so biased you can get a decent AUC by only looking at the target or ligand in isolation (lol).
Docking ligands doesn't make for particularly great structures, and snapshot structures really miss out on the important dynamics.
So it's hard for me to imagine how alphafold can help with small molecule development (alphafold2 doesn't even know what small molecules are). I agree it totally sounds plausible in principle, I've been in a team where such an idea was pushed before it flopped, but in practice I feel there's much less use to extract from there than one might think.
EDIT: To not be so purely negative: I'm sure real use can be found in tinkering with AlphaFold. But I really don't think it has or will become a big deal in small drug discovery workflows. My PoV is at least somewhat educated on the matter, but of course it does not reflect the breadth of what people are doing out there.
But Crispr actually edited genes. How much of this theoretical work was real, and how much was slop? Did the grad students actually achieve confirmation of their conformational predictions?
Surprisingly, yes the predicted structures from AlphaFold had functional groups that fit with experimental data of binding partners and homologues. While I don't know whether it matched with the actual crystallization, it did match with those orthogonal experiments (these were cell biology, genetics, and molecular biology labs, not protein structure labs, so they didn't try to actually crystalize the proteins themselves).
It solidly answered the question: "Is evolutionary sequence relationship and structure data sufficient to predict a large fraction of the structures that proteins adopt". the answer, surprising few, is that the data we have indeed can be used to make general predictions (even outside of the training classes), and also surprising many, that we can do so with a minimum of evolutionary sequence data.
That people are arguing about the finer details of what it gets wrong is support for its value, not a detriment.
That's a bit like saying that the invention of the airplane proved that animals can fly, when birds are swooping around your head.
I mean, sure, prior to alphafold, the notion that sequence / structure relationship was "sufficient to predict" protein structure was merely a very confident theory that was used to regularly make the most reliable kind of structure predictions via homology modeling (it was also core to Rosetta, of course).
Now it is a very confident theory that is used to make a slightly larger subset of predictions via a totally different method, but still fails at the ones we don't know about. Vive la change!
I think an important detail here is that Rosetta did something beyond traditional homology models- it basically shrank the size of the alignments to small (n=7 or so?) sequences and used just tiny fragments from the PDB, assembled together with other fragments. That's sort of fundamentally distinct from homology modelling which tends to focus on much larger sequences.
3-mers and 9-mers, if I recall correctly. The fragment-based approach helped immensely with cutting down the conformational search space. The secondary structure of those fragments was enough to make educated guesses of the protein backbone’s, at a time where ab initio force field predictions struggled with it.
Yes, Rosetta did monte carlo substitution of 9-mers, followed by a refinement phase with 3-mers. Plus a bunch of other stuff to generate more specific backbone "moves" in weird circumstances.
In order to create those fragment libraries, there was a step involving generation of multiple-sequence alignments, pruning the alignments, etc. Rosetta used sequence homology to generate structure. This wasn't a wild, untested theory.
I don't know that I agree that fragment libraries use sequence homology. From my understanding of it, homology implies an actual evolutionary relationship. Wheras fragment libraries instead are agnostic and instead seem to be based on the idea that short fragments of non-related proteins can match up in sequence and structure space. Nobody looks at 3-mers and 9-mers in homology modelling; it's typically well over 25 amino acids long, and there is usually a plausible whole-domain (in the SCOP terminology).
But, the protein field has always played loose with the term "homology".
Rosetta used remote sequence homology to generate the MSAs and find template fragments, which at the time was innovative. A similar strategy is employed for AlphaFold’s MSAs containing the evolutionary couplings.
Interestingly, the award was specifically for the impact of AlphaFold2 that won CASP 14 in 2020 using their EvoFormer architecture evolved from the Transformer, and not for AlphaFold that won CASP 13 in 2018 with a collection of ML models each separately trained, and which despite winning, performed at a much lower level than AlphaFold2 would perform two years later.
I’m in biotech academia and it has changed things already. Yes the protein folding problem isn’t “solved” but no problem in biology ever is. Comparing to previous bio/chem Nobel winners like Crispr, touch receptors, quantum dots, click chemistry, I do think AlphaFold already has reached sufficient level of impact.