Something that I always found fascinating is how DNA is a base 4 information format. There's this thing called radix economy, which is basically an expression of how efficient a number system is. Base e is the theoretical maximum, and so base 3 is the closest integer.
Obviously if you have a special use case, then that may dominate your radix economy (like hex, b64, etc...), but for general purpose information purposes, the order base 3, base 4, then base 2.
This present a lot of interesting questions to me. Like, why didn't DNA end up as base 3? (probably because 4 naturally lends itself to pairs of 2).
Also, this idea of radix economy goes beyond just the encoding of information and is represented in logical economy as well. So for example, ternary logic is (much) more efficient than binary logic. Having that 3rd state just makes problem solving much more elegant.
To that end, I have always wondered how nature has exploited this 4-state number system logically. Like, are there all sorts of exotic logic gates that come from a 4 state system?
Why did we end up with only 20 proteinogenic amino acids? Why are vertebrate neural architectures inverted (cell bodies on the inside, connections on the outside, even though the other way round way (eg. like a squids brain is organised) is easier and less inhibitive to growth?
2 Reasons:
a) Because nature and evolution cannot engineer. Random mutation, recombination and natural selection are the only mechanisms available. Things get selected if they outcompete existing alternatives, they don't need to be the best solutions.
b) All solutions have to be built by modifying what already exists. Evolution doesn't get to do greenfield projects, because anything that has to start from scratch is so disadvantaged in natural selection compared to already evolved complex life, it will fail.
This leads to systems that, from an engineering point of view, don't always make a lot of sense.
Eg. the architecture of the vertebrate neural system creates a lot of issues (eg. our light sensitive cells point in the wrong direction). The only way this makes any sense if when one looks at how the neural tube (the precursor to the backbone) is formed by the endodermis folding in on itself. This process is so deeply at the root of the Chordata, and so many other things depend on it, that it simply cannot change any more.
Many many biological systems are "legacy systems" in the truest sense of the word: Solutions produced a long time ago that may have many problems, but are simply too deeply enmeshed with everything that came after, that they are now impossible to change.
A classic armchair response. DNA has complementary nucleotides (AT,GC) that facilitates its pairing. Base 3 wouldn’t work in that sense. Also, you can’t forget about the genetic code. See https://arxiv.org/pdf/q-bio/0605036.pdf for interesting thoughts. Remember, evolutionary biology is a field and people think about these questions!
This is pretty smug for someone who seems to have managed to miss the point entirely. Yes, DNA has certain features that require a base 4 system. That is not necessarily true of all possible systems with DNA-equivalent function, which is the point this whole thread is making.
How have I missed the point? The answer that nature cannot engineer and can't start de novo are trivially true statements that provide no actual insight into the question. I fully agree the original question itself is a deep one. A quick literature search is more productive than pontificating with weak analogies. See https://www.math.unl.edu/~bdeng1/Papers/DengDNAreplication.p... for what seems to be an interesting analysis regarding base number and DNA replication rate.
> that provide no actual insight into the question
Mind elaborating on that?
Because there is no biochemical reason why DNA could not have incorporated, say, a third pairing pair, so while base-3 (which I don't specifically mention in my post btw.) wouldn't work, base 6 or 8 would have been possible. "Unnatural Base Pairs" are even known to work in laboratory settings.
There is also no biochemical reason why base2 life wouldn't work. Expand the reading frame of the translation machinery to 5 instead of three, and you have enough coding space for polypeptides.
My answer adresses the question completely, because the only reason behind these "decisions" is an ancient system that simply got "frozen", and now cannot change any more.
> There is also no biochemical reason why base2 life wouldn't work.
are you sure about that? are you sure there's no weird effects that might destabilize very long sequences of 2-nucleotide DNA? or on how wide DNA-binding domains have to be to cope with reduced information density, and how that might sterically hinder smaller arrangements of proteins?
> My answer adresses the question completely, because the only reason behind these "decisions" is an ancient system that simply got "frozen", and now cannot change any more.
your answer is just a hypothesis, not a proof. these things can be studied (by studying abiogenesis in-vitro), and it's not certain these decisions were "flash frozen" like you describe. 2-, 4-, and 6- nucleotide coding systems might have coexisted in the RNA world, and 4- could have won out for some reason.
Yes, I am sure about that, because I used to study Biology before going into IT. And we had a lovely lecture in which we used to discuss theoretical setups for lifeforms at a molecular level.
2 nucleotide DNA isn't necessarily less stable. AT-rich domains have less bindings, but if stablity is the issue, use CG instead (3 bindings)...although that is also a compromise, because then opening DNA for transcription gets more difficult.
> your answer is just a hypothesis, not a proof.
My answer is what we observe in evolutionary biology.
I have given an example outside of the molecular world for a reason. There is no real advantage to the inversion of the neural architecture in Chordata, it just didn't matter when the neural tube formation mechanisms came to be. Now, with mammals having huge brains and complex sensory organs, the warts in that design show.
The proof for that is easy to come by, (also a reason btw. why the neural inversion is my favorite example for this): Look an any Protostomia. Their neural system isn't inverted. Consequently, Squids don't have a visual blind spot.
your example of the blind spot is quite elegant and convincing. I think it's partly so convincing because there's a large fossil record and diverse phylogenetic tree, with many gaps covered. conversely, we're missing direct evidence for the pre-LUCA era, and what we have is bottlenecked. this makes me more skeptical.
for instance, I've seen arguments that the codon mapping, and even the particular set of protein- coding amino acids, that we ended up with was arbitrary, but I've also read papers arguing that the amino acids include a sort of spanning set of different structural scaffolds with different polarity that happen to mesh well with DNA, and that the particular choices of codons were influenced by how the RNA t-acyl transferases arose, etc.
so, I'm still unconvinced, but I find this area fascinating to read about.
Idk enough about this discussion to argue it, but his hypothesis does not imply your second point couldn't be true.
> your answer is just a hypothesis, not a proof. these things can be studied (by studying abiogenesis in-vitro), and it's not certain these decisions were "flash frozen" like you describe. 2-, 4-, and 6- nucleotide coding systems might have coexisted in the RNA world, and 4- could have won out for some reason.
His hypothesis is, at least in part, “4- won for some reasons for which we have no explanation, and it stayed that way for some reason [that we may or may not know].” I suppose the reason would be that 4- was somehow better suited for the particular use-case at the time.
Of course there’s a ton of interesting details to discuss to discover, and whether if multiple systems coexisted is one of many fascinating things to discuss, and his response never said otherwise.
Short answer: Likelihood of noise (brownian motion) producing the element and keeping it interacting. Then once it gets going, likelihood of keeping state, while interacting.
Lefthand is a vertebrate eye, righthand is a squids eye.
In Vertebrates (really in all Chordata), the light sensitive "tips" of the sensory cells point inwards, aka. the exact wrong direction. At the base of the cells are the axons (nerve connections) which transmit the information into the brain.
Due to the aforementioned orientation, these axons run along the outer layer of our light sensitive cells, and at some point have to travel "invards" towards the brain. At that point there can be no cell bodies, and that's the "visual blind spot" of our eyes.
A squids eye doesn't have that problem; all the light sensitive cells point outwards, the axons are at the innermost layer, and connectivity can be achieved without a blind spot (also, they don't need a reflective layer).
I had to look this up, and I guess what usrbinbash was referring to was the layout of the retina, which places the rods and cones behind layers of transparent neurons.
Yet, it doesn't really have a strong impact as it's been determined that humans can see individual photons and we aren't dependant on night vision for hunting.
It doesn't have a strong impact, and the design also doesn't prevent good night vision (the basic structure of a cats eye is similiar, ALL chordata have an inverted neural makeup).
But that doesn't mean the setup makes sense, and that is exactly my point.
And long term, this has an impact. For example, vertebrate brain size is limited by the simple factor, that we have to put all the connections on the outside. The more neuronal bodies we have, the more connections they require.
N <---> N
In this clumsy diagram, 2 neurons talk with the connections on the inside. However, vertebrate brains have to do this instead:
+--------------+
| |
+-> N N <-+
It's easy to see how the second setup becomes prohibitive when more Neurons are added to it. The brains of Protostomia again don't have that problem...they can have the connections on the inside, and the neuron bodies on the outside, aka. the logical setup.
Now there are ways around that, eg. Reptile and Bird brains grow in bulbs that theoretically allow sustained growth without the connective layer getting in the way. But similar to the reflective layer in our eyes, this is not a setup that's there because it makes a lot of sense...it's a hack, a workaround for some "legacy system", that is now so enmeshed, it's impossible to change.
It’s a bit anthropocentric to talk about not making sense from an engineering point of view. One example is the recurrent laryngeal nerve which always appears to take unnecessary detours to people because of what is thought to be historical evolution. But there is deep wisdom and insight we have gleaned in this, but I think it’s not for us to say well we could engineer this better, we don’t have the total knowledge of tools yet & it is dismissive and say disrespectful of the wondrous biological systems that have been made to sustain life.
Hubris and attempts to alter inherent nature are often tied up ironically. But we can benefit a lot more from biological humility, realizing there are many unknown unknowns.
DNA has the same limitation that many serial protocols have: if you repeat the same base pairs (e.g. "AAAAAAAAAAAAAAAAAAAAAA") you will have trouble w/ the DNA not spiraling correctly. Some sequences of 2-6 repeated base pairs seem to "deliberately" cause variant behavior in DNA and RNA, see
DNA coding for real proteins is unlikely to be too terribly repetitive but I image a long α helix could have a repetitive amino acid sequence. Many amino acids can be coded with variant codons, I guess if repetition were a problem in a particular gene natural selection could step in.
DNA is not really processed like that, afaik. Mostly, each 3 bases code for an amino acid, which are glued together to a string (protein), which folds in a 3D structure based on the characteristics of all amino acids.
Some DNA is used to attract other proteins, or even interact with DNA elsewhere on the strand, or is translated to RNA (one-on-one) which can then have a function based on its sequence or the structure it folds into.
On paper this might be an interesting game, but you have to think of things in terms of crystal structure, what is able to form hydrogen bonds, what ends up being sterically hindered and what that means for the molecule. This is why watson and crick and franklin's work was so seminal, it showed how genetic information was inherited through mechanical logic of these molecules alone. Before the structure of DNA was solved, there were a lot of competing theories over what molecule was the source of heritable information, and how this information was exactly passed down between generations.
Might error-correction play a role? Having a lightly inefficient base 4 system might provide capacity for the surplus error correcting code information capacity?
DNA mostly relies on the fact that there's 2 strands that are (logically speaking) a mirror copy of each other (a C is paired with a G and vice versa, an A to a T and vice versa), it's like RAID 3 with only 2 disks (one being parity).
Apart from repairing structural damage such as missing bonds, the cell can even repair missing bases or non-straight breaks without loss. This mechanism is also used for replication: the entire strand is split and each half is completed with its mirror counterpart.
Im aware of that, but was rather thinking about ECCs like hamming-code, that are able to correct single sequences of info based on surplus info in that same string.
BUT, if you look at the codon table, precisely because it's base-4 and not base-3, many base flips are silent when coded.
By using base-4, there's enough space to permit lossiness of the coding itself - given the number of amino acids and the 3-NT encoding.
So you really aren't optimizing JUST for nucleotide encoding, but you're also optimizing in concert with 3-nt/AA, and 20AA codes.
So if you have to optimize for information density and fidelity, given X-nucleotides, Y nucleotides/AA, and Z AAs, and sample as much chemical and physical diversity in those AAs life has settled upon:
X=4, Y=3, Z=20.
If we went with X=3, you might need Y=4 to get the same kind of fidelity, but that cranks up your energy costs by 30% (from 3 to 4 NT per AA).
If you change one of those numbers, you'll need to rejigger the rest, and you'd need to reoptimize. And there are competing goals which at least include:
- maximize access to biophysical/chemical diversity
- minimize energy expenditure to produce each component, chemically
- minimize energy expenditure to both copy instructions & produce products
- maximize information fidelity
- minimize or at least degrade gracefully in the context of errors
In the context of a 3-base system, you very well could throw off those optimizations given the consequences for the other 2 parameters (#AA & nt/AA). 3^3 = 27, which is very close to the maximum of 20 amino acids. Which means you'd probably need a 4nt->AA translation layer to keep the same number of AAs, and that alone would add 30% more energy expenditure. If you kept the 3nt->AA system you'd BOTH need to reduce the number of accessible amino acids AND you'd lose some of the error correction mechanisms of having degenerate codons code for the same amino acid.
One, is that base 4 makes a lot of sense for the stability of DNA structures. You have two purines, two pyrimadines.
Another is that partly because codons are degenerate, the distribution is way off a uniform distribution. For chemistry and mol bio reasons, the distribution of AGTC is very skewed.
When i fully wake up, this might be a fun blog post to draft.
Maybe it has something to do with protein folding, since apparently scientists do not understand the physical mechanisms that translate a given DNA sequence into a given 3D shape. They best they can do is infer probabilistic patterns using ML to proxy a mechanistic of action. Maybe there's something quantum at work that means it will never be fully deterministic.
Why do my eyelashes, meant to protect my eyes, fall into my eyes? Why do my cheeks/tongue sometimes get in the way of my teeth so that I bite them? And why do they then get inflamed so that I continually byte them for the next few days?
We are all a bunch of biological goop resulting from random processes. Don't expect optimal solutions from evolution. There is no "why".
There is a why and reasons we find and insights we glean. The denial of a why is itself an anthropocentric take on biology, influenced by Darwinian thought. But any biological system we study we need a why, otherwise the corollary of your statement essentially leads to no learning or understanding, because everything becomes “arbitrary” and explained away by randomness.
Ironic to say it’s not optimal, really we don’t have the full knowledge. Often when we learn more we learn how little we really know about biology.
It's probably not base four because you have to stretch out more pairs to match up four pairs and that's entropically disfavored. However ribosomes can accomodate a four pair matching, though at a very reduced yield (unless you think Schultz's postdoc fabricated those data)
I believe "on" and "off" in electronics typically correspond to different voltage levels. So you absolutely could have a third intermediate state if you wanted to. Flash memory does this (and even sometimes has 4 states). I guess designing switches (transistors) that could take advantage of and propagate these extra states could be tricky though.
I’m also failing to see how digit efficiency would be important in DNA. In fact, it seems that a high base system would be more efficient. If you had 80 nucleobases instead of 4, each base pair would contain far more information
Strictly speaking, you could encode more error-correction bits into a higher nucleobase count but that'd pretty much require intelligent design, and wouldn't necessarily be viable for microorganisms to have that many proteins handling all the metabolic scaffolding.
The efficiency comes from the ratio of the alphabet to the number of character places needed to express them. Otherwise why not base a million? Or a billion?
This ratio is what leads to base e being the theoretical maximum.
Obviously if you have a special use case, then that may dominate your radix economy (like hex, b64, etc...), but for general purpose information purposes, the order base 3, base 4, then base 2.
This present a lot of interesting questions to me. Like, why didn't DNA end up as base 3? (probably because 4 naturally lends itself to pairs of 2).
Also, this idea of radix economy goes beyond just the encoding of information and is represented in logical economy as well. So for example, ternary logic is (much) more efficient than binary logic. Having that 3rd state just makes problem solving much more elegant.
To that end, I have always wondered how nature has exploited this 4-state number system logically. Like, are there all sorts of exotic logic gates that come from a 4 state system?