[Disclaimer: I used to work at D. E. Shaw Research from 2011-2016]
The early Anton 1 numbers of 17us/day on 100K atoms were huge leap forward then. At that time, GPU-based simulations (e.g. GROMACS/Desmond on GPU) were doing single digit ns/day. Remember, even for 'fast-folding' proteins, the relaxation time is on the order of us and you need 100s of samples before you can converge statistical properties, like folding rates [0]. Anton 2 got a 50-100x speed-up [1] which made it much easier to look at druggable pathways. Anton was also used for studying other condensed matter systems, such as supercooled liquids [2].
Your question of why is this so slow or small is prescient. On the reasons that we have to integrate the dynamical equations (e.g. Newtonian or Hamiltonian mechanics) at small, femtosecond timesteps (1 fs = 1e-15s) is because the vibrational frequencies of bonds are on the order of picoseconds (1 ps = 1e-12s). Given that you also have to compute Omega(n^2) pairwise interactions between n particles, you end up having a large runtime to get to ns and us while respecting bond frequencies. The hard part, for atomistic/all-atom simulation is that n is on the order of 1e5-1e6 for a single protein with 100s of water molecules. The water molecules are extremely important to simulate exactly since you need to get polar phenomena, such as hydrogen bonding, correct to get folded structure and druggable sites correct to angstrom precision (1e-10 meters). If you don't do atomistic simulations (e.g. n is much smaller and you ignore complex physical interactions, including semi-quantum interactions), you have a much harder time matching precision experiments.
[1] https://ieeexplore.ieee.org/abstract/document/7012191/ [the variance comes from the fact that different physics models and densities cause very different run times -> evaluating 1/r^6 vs. 1/r^12 in fixed precision is very different w.r.t communication complexity and Ewald times and FFTs and ...]
This explanation is interesting. Thanks for sharing it. While reading it, I got the impression that the simulation is not fully quantum mechanical, but rather classical with select quantum mechanical effects.
Which parts of quantum mechanics are idealised away and how do we know that not including them won't significantly reduce the quality of the result?
Are you possibly using stochastical noise in the simulations and repeat them multiple times, in the hope that whatever disturbance caused by the idealisation of the model is covered by the noise?
That's a good question and there are a number of ways to try to tackle this. One of the main reasons you cannot do QM simulations directly is that the high quality methods can cost Omega(n^6/eps) to get eps. relative accuracy (you can do better with DFT, but then you're making your life hard in other way). At a high-level (and I mean, 50,000 ft. level), here are the simplest way:
1) Do quantum mechanics simulations of interactions of a small number of atoms — two amino acids, two ethanol molecules. Then fit a classical function to the surface E[energy(radius between molecules, angles)], where this expectation operator is the quantum one (over some separable Hilbert space). Now use the approximation for E[energy(r, a)] to act as your classical potential.
- Upshot: You use quantum mechanics to decide a classical potential for you (e.g. you chose the classical potential that factors into pairs such that each pair energy is 'closest' in the Hilbert space metric to the quantum surface)
- Downside: You're doing this for small N — this ignores triplet and higher interactions. You're missing the variance and other higher moments (which is usually fine for biology, FWIW, but not for, say, the Aharanov-Bohm effect).
2) Path Integral methods: This involves running classical simulation for T timesteps, then sampling the 'quantum-sensitive pieces' (e.g. highly polar parts) in a stochastic way. This works because Wick rotation lets you go from Hamiltonian evolution operator e^{i L}, for a Lagrangian density L, to e^{-L} [0]. You can sample the last density via stochastic methods to add a SDE-like correction to your classical simulation. This way, you simulate the classical trajectory and have the quantum portions 'randomly' kick that trajectory based on a real Lagrangian.
3) DFT-augmented potentials: A little more annoying to describe, but think of this as a combination of the first two methods. A lot of the "Neural Network for MD" stuff falls closer in this category [1]
[0] Yes, assume L is absolutely continuous with regards to whatever metric-measure space and base measure you're defined over :) Physics is more flexible than math, so you can make such assumption and avoid thinking about nuclear spaces and atomic measures until really needed
> Upshot: You use quantum mechanics to decide a classical potential for you (e.g. you chose the classical potential that factors into pairs such that each pair energy is 'closest' in the Hilbert space metric to the quantum rest) - Downside: You're missing the variance.
Couldn't the quantum mechanical state become multimodal such that the classical approximation picks a state that is far away from the physical reality?
And, couldn't this multimodality excaberate during the actual physical process and possibly arrive at a number of probable outcomes which are never predicted by the simulation? Is there more than hope that that doesn't happen?
Yes, for sure. In practice (and not at the 50,000 ft. level), you do try to include the multimodalities — you don't _really_ just use E[quantum_energy(r)]. But you ARE still reliant on some computable/smooth/Lipschitz moment and/or expectation from the quantum surface. The semi-heuristic argument for why you get away with this in biological simulation is somewhat heuristic, but of the following form:
- Most quantum field theories are described by of the form L(E), where E is an energy level ["effective field theory"] and the Lagragian changes as E change.
- When E is low, L(E) is classical mechanics & EM
- When E is around 1GeV, L(E) is the aforementioned plus QED
- When E is around 100GeV, L(E) has the aforementioned plus some QCD
- When E is at 1 TeV, L(E) has the aforementioned plus Higgs-like stuff
Now biology is on the lowest end of that scale, so you mainly have to deal with QED and perturbative electronic expansion. These electronic expansions are the most important part — you need them to get hydrogen bonding + electrodynamic molecular interactions correct — BUT they are highly local.
This locality is what you take advantage of when you normalize — you find from QM that the potentials only matter when the two charged/polar molecules are close, so you try to make a classical potential that has quantum 'jumps' when these things are close.
Do you miss the purely quantum stuff? Aharanov-Bohm, Chern classes, and the like? Of course. But from a practical standpoint, you do get the structures that you measure from experiment to be correct because the 'cool' quantum with 'tons' of states is less important for pedestrian things at low energy scale.
It is still hard to get right though! There's a lot of entropy you need to localize correctly and in some sense, you have to make sure you get the modes as a function of local particle positions correct.
The final thing to point out is that the Wick rotated path integral stuff works for biology much better than for real HEP-type of stuff because molecules are contained at low energies — those tunneling probabilities are O(h E), and log(E) is still dwarfed by -log(h) so you _can_ safely ignore them.
This is not true for things like circuits, however, because the lithography at EUV scales (3nm -_-) does have tunneling issues at high field strengths.
tl;dr: Biology has some saving graces that give you good approximations. Are they perfect? No, but if you find a time that I have to compute a vanishing first Chern class in a noisy, ugly biological system, then you deserve a Nobel Prize!
I think it's still an open question whether you truly need an ensemble of long simulations (versus many shorter ones that can be generated in embarassingly parallel mode on GPUs).
Nonetheless, I'm thankful for both DESMOND and Anton, which helped push the MD community out of its moribund state in the early 2000s. I still don't think MD simulations produce anything that exceeds the opportunity value of the power they consume, and it seems unclear that they ever will, although I still would love to know the answer to the question: "could sophisticated classic forcefields with reasonable O(nlogn) scaling every be truly valuable as a virtual screening feature generator/evaluation mechanism".
I agree that it is still unsolved, from a practical perspective. I think both sides (the MD vs. ensemble sampling) have incorporated techniques that the other side uses to improve sampling efficiency and accuracy. At the same time, I suspect that the sampling methodology only works with some form of structured guidance, whether it be MD or embedding (a la AlphaFold, which has a few former DESRES people working on it). The raw Monte Carlo methods that people use for 'embarrassing' parallelism often have terrible scaling — the spectral gaps of the Markov chains are abysmal and you only realize that after 1000s of core-hours.
On the other hand, DESRES had been focusing on a lot of acceleration methods that involved hardware optimized HMC-esque methodologies that had reasonably good parallelism. AFAICT the only public description of this work is in the appendices of this paper [0].
At the end of the day, you probably need both techniques because the pure sampling approaches lose fine structure (e.g. binding pockets opening up with anomalous frequency due to water clusters) whereas the standalone MD model has too long of a decorrelation time to get averages to converge.
This seems extremely contrived and at best misleading, if it is meant to be more than an exercise to teach someone to code. If there were a function that could produce subsets of primes in a sequence via a C-infinity function (as you are doing via your trained neural net), then it is extremely, extremely unlikely that all of the known pseudo-randomness properties of prime numbers and their k-point functions are true. In some ways, this would be a sort of converse to the Green-Tao theorem --- one can 'easily' produce subsets of primes contained in arithmetic or geometric sequences. Can you provide a better justification for this project? I tend to find 'approximating primes via neural networks' as a silly combination of buzzwords that ignores all of the known facts about primes. At best, this project appears to perform a numerical experiment of the claim,
"For all epsilon > 0, there exists an easy to compute C-infinity function (that can depend on epsilon) that can take an arithmetic/geometric/arithmetico-geometric sequence and produce, with high probability, a subset of primes of this sequence whose volume relative to all primes in this sequence is at least 1 - epsilon"
I agree this approach is unlikely to be fruitful. I think the author doesn’t realize how frequent primes are (for example, primes are a lot denser than square numbers).
If you do the minimal to avoid obvious non-primes, avoiding numbers divisible by 2 or 5, you can expect to find a N-digit prime checking about N random N-digit numbers (1), so finding a 4,000-digit one after experimenting for a while doesn’t indicate ability to find primes.
(1) the density of primes around 10ⁿ is about 1/ln(10ⁿ), so you expect to find a prime after ln(10ⁿ) random samples, and
ln(10ⁿ) = n * ln(10) ≈ 2.3 * n
Avoiding even numbers and multiples of five gives you back a factor of 2.5, more than offsetting that factor of ln(10)
While the analysis contained in this article is rather arbitrary, blustery and imprecise, there is a mechanism by which blockchains can serve as 'partially synchronized clocks.' One interesting facet about Bitcoin is that the model under which you analyze the effect of network delays can give you different results about how unsynchronized local copies of the blockchain can become. In the paper, "Analysis of Blockchains in Asynchronous Networks" [0], a certain (and more realistic) model is used in which adversarial miners are allowed to delay forwarding of blocks that they receive. The goal of this model is to describe how networking can account for changes to the probability of selfish mining [1]. The authors of [0] prove that for Bitcoin, under some assumptions about the maximum delay an adversary can use and the block production rate, is consistent because it has both a lower bound on how much chains between different participants can differ as well as an upper bound about how far ahead a single participant can grow their chain. Note that all of these statements have to be interpreted as statements "with high probability" relative to security parameters.
In [0], the authors argue that Bitcoin, in particular, is a partially-synchronized global clock, and they take advantage of it in [2] to make a hybrid blockchain-DB system that provides more traditional eventual consistency guarantees.
NOTE: I'm not an author of any of these papers, just a casual observer
The diagrams remind me of performing a Dehn twist. Perhaps what you're doing at each grouping corresponds to a Dehn twist on the Tonnetz? Of course, since the Tonnetz is a particular subdivision of the torus, only certain groupings that satisfy some divisibility relations will correspond to Dehn twists.
I should point out that OpenAI put out a blog post / paper suggesting that genetic algorithms can perform similarly to backpropagation for Atari/Q-Learning tasks:
Doesn't this article feel like it was written for Yahoo's benefit? Since they have a blinded bidding process (See [1]), it seems as if this article could help drum up higher bids. Maybe Yahoo is the anonymous source that provided the "hot tip" about commodity traders on Yahoo Messenger.
I don't know about the timing, but I have a relative in the industry and he told me it's actually the case. Everyone and their mom is on Yahoo chat when it comes to commodities.
confirming, ex-gf who is commodity trader (coal, energy and oil lately) mentioned yahoo chat all the time. most of them are not very tech-savvy, so they are happy with it as long as it works
I'm not sure it's a positive story. It suggests that yahoo will migrate these customers to a new platform that doesn't support the features they like and legally need.
> Berlekamp also created an algorithmic trading system that he sold to Jim Simons and that became a starting point for Renaissance Technologies, now the world’s largest hedge fund.
As far as I know, they only have to disclose to the SEC how much money they have invested in equities. Since they don't publicly state total assets under management, we can't really know how much money Renaissance manages. They could have loads more than $52b invested in debt, commodities, currencies, etc.
The early Anton 1 numbers of 17us/day on 100K atoms were huge leap forward then. At that time, GPU-based simulations (e.g. GROMACS/Desmond on GPU) were doing single digit ns/day. Remember, even for 'fast-folding' proteins, the relaxation time is on the order of us and you need 100s of samples before you can converge statistical properties, like folding rates [0]. Anton 2 got a 50-100x speed-up [1] which made it much easier to look at druggable pathways. Anton was also used for studying other condensed matter systems, such as supercooled liquids [2].
Your question of why is this so slow or small is prescient. On the reasons that we have to integrate the dynamical equations (e.g. Newtonian or Hamiltonian mechanics) at small, femtosecond timesteps (1 fs = 1e-15s) is because the vibrational frequencies of bonds are on the order of picoseconds (1 ps = 1e-12s). Given that you also have to compute Omega(n^2) pairwise interactions between n particles, you end up having a large runtime to get to ns and us while respecting bond frequencies. The hard part, for atomistic/all-atom simulation is that n is on the order of 1e5-1e6 for a single protein with 100s of water molecules. The water molecules are extremely important to simulate exactly since you need to get polar phenomena, such as hydrogen bonding, correct to get folded structure and druggable sites correct to angstrom precision (1e-10 meters). If you don't do atomistic simulations (e.g. n is much smaller and you ignore complex physical interactions, including semi-quantum interactions), you have a much harder time matching precision experiments.
[0] https://science.sciencemag.org/content/334/6055/517
[1] https://ieeexplore.ieee.org/abstract/document/7012191/ [the variance comes from the fact that different physics models and densities cause very different run times -> evaluating 1/r^6 vs. 1/r^12 in fixed precision is very different w.r.t communication complexity and Ewald times and FFTs and ...]
[2] https://pubs.acs.org/doi/abs/10.1021/jp402102w