I should have said archival nodes, the ones that keep state back to the genesis block. I don't know if that number is even tracked anywhere. I've read estimates ranging from 2 to 5. I'm trying to find where I read that, happy to be wrong - or right, if anyone has data.
[edit] Here. [1] And here. [2]
After examining every which way we could think of to add the Trie state to our Ethereum state, we asked Vitalik for assistance. His first comment to us was “oh you’re one of the few running one of those big, scary nodes.” We asked him if he knew of anyone else running a “big, scary node” to see if we could possibly sync with them. He knew of no one, not even the Ethereum Foundation keeps a full archival copy of the Ethereum chain. [2].
I've run quite a bit of analytics on ethereum and have downloaded the entire chain multiple times for processing and it's freely available from multiple providers. All the major API providers (infura, etherscan, etc) have the all the raw blocks available readily.
Some Erigon nodes run with pruning enabled. You can't tell which ones those are, or how much pruning.
Technically you can tell which Geth nodes are archive nodes with a GetNodeData query over devp2p, although that call is deprecated and will eventually be removed. Its replacement, GetTrieNodes, cannot be used for this.
Archival nodes also keep state back to the genesis block, it's just stored in delta format so you could say that it's not "unpacked" out to the disk. It's a common misconception that "full nodes" don't have all this data.
> Every now and then someone will argue on CT that Ethereum full nodes are not complete nodes because archive nodes exist. I decided to run a little experiment to disprove a few things
> The goal was to convert a full node into an archive node, demonstrating that Ethereum full nodes contain all the necessary blockchain data.
> 28 days later, I can confirm that it worked. I started with a 150 GB full node and expanded it to an archive node weighting 2.3 TB, without external network connectivity.
A full node lets you fully verify the chain's historical states and it lets you interact with the current state. Unless you're running a service that exists solely to allow people to query historical states (like a block explorer service), I don't see why it would be useful to be able to query historical state.
You need an archival node to see a list of all transaction that transfer eth into an address.
A full node can only give you the current balance, and a list of all transactions that directly transfer eth to that address. Any transaction that transfers eth as the side effect of a smart contract is invisible.
I personally see it as a flaw in the design of eth. You shouldn't need the complete history of states just to find all relevant transactions, but you do.
Besides, the argument that regular users shouldn't need to query such information it doesn't change the fact that the information is unqueriable in a full node, short of spending 28 days transforming it into an archival node.
I'll give you that. If you need to query a list of all contract transactions that have ever transferred ETH to your address, I believe you would need an archive node to do so although don't quote me on that.
> Besides, the argument that regular users shouldn't need to query such information it doesn't change the fact that the information is unqueriable in a full node, short of spending 28 days transforming it into an archival node.
If you don't need to query the data, then the data doesn't have to be unpacked and indexed for querying. Seems simple to me.
It's kind of misleading to claim the archival is packed. It's not compressed into some archival format. Instead, the full node contains all the inputs to regenerate the data.
To transform into an archival node, a full node has to rewind to the very first block, and replay every single transaction.
Since the EVM is Turing complete, this is roughly equilvent to stimulating a computer with years of recorded keyboard and mouse inputs, taking care to record how each input effects state of the computer.
You can't jump to the middle, you have to replay the whole thing.
I don't think it's misleading to call Git history "packed", and the mechanism for regenerating historical states is similar to Ethereum's (though of course Git's delta function is changeset-only with no turing-completeness). In fact, Git calls its own delta-storage "Git packfiles".
The EVM is a very simple and rudimentary virtual computer, so replaying the whole thing isn't an impossible task. According to the tweet, it took this guy's computer 28 days to replay 4 years of history.
Git also adds snapshots to the mix, which makes it possible to rapidly jump to fixed points in history and only use deltas for the fine grained seek. Git also has indexes to find stuff.
Git justifies the viability of it's "packing scheme" by actually making everyday use of it.
A full eth node has no snapshots or useful indexes into the archival data. It has to apply the deltas linearly from the beginning. Applying the deltas is very slow, very IO bound, seeking all over the disk.
The data might be there, but it's practically useless. A user who discovers they need some archival data is never going to consider waiting weeks for the nearly 7 years of history to be replayed before running their query. Instead they will head over to etherscan and trust whatever it says.
Those all sound like local database features that one could add to an Ethereum client if they found them useful enough to bother, they aren't protocol-level concerns or "flaws in the design of eth" as you put it earlier.
> The data might be there, but it's practically useless.
The availability of the packed data is useful, just not to the end user of the node. Having this data widely available on the network means that anyone can spin up an archive node by peering with other full nodes, they don't need to discover and peer with the very limited number of other archive nodes, and the network doesn't need to worry about losing that data permanently if all archive nodes go offline.
> A user who discovers they need some archival data is never going to consider waiting weeks for the nearly 7 years of history to be replayed before running their query. Instead they will head over to etherscan and trust whatever it says.
Call me unprincipled but I don't think it's an issue that if a user needs data above and beyond what's needed to fully verify the chain and read and write to it, they're expected to either spin up a more resource-intensive node or retrieve the data from a specialized history service. Statelessness is on the roadmap, so in the long-term the historical data that Etherscan and similar services serve up to you will come with a validity proof anyways.
I'm fine with you dropping the principles of decentralization and accepting that the current situation is ok.
You can construct many great arguments that the increased centralization is a good thing, or that the upsides are better than the downsides.
What I take issue with is attempts to classify ethereum "Full Nodes" as more than what they are. Yes, they technically contain all the information requires to reconstruct an archival node (at least until statelessness becomes a thing).
They are simply not anywhere near the same thing, and attempts to brand them as the more or less same thing just comes across as denial.
> They are simply not anywhere near the same thing, and attempts to brand them as the more or less same thing just comes across as denial.
They are the same thing specifically when it comes to:
* Downloading, verifying, and storing every transaction that has ever happened on the network
* Maintaining a tamper-proof, data-complete copy of the blockchain
* Interacting with the blockchain in a maximally verified, maximally secure way
I never said that they were exactly the same thing or that they should be branded as the same thing, I said that they store the same data (by which I mean from an information-theoretic standpoint), which is true.
> What I take issue with is attempts to classify ethereum "Full Nodes" as more than what they are.
I take issue with the attempts to classify them as less than what they are.
What needs to be squashed is the common idea in the OP that "full nodes are not actually full" because there's a "fuller" "archive" node that has the states indexed on-disk. The difference between a full node and an archive node is perfomant historical queryability, not security or data-completeness.
OP says that "access to Ethereum is effectively gate-kept by two centralized entities", which is untrue because you don't need an archive node to access Ethereum, only a full node. OP's idea that an archive node is the only "true Ethereum full-node" is common baloney that pops up often in the cryptocurrency community.
This is where the difference between theory and reality start to become an issue.
Yes, in theory the full node contains the full blockchain. Yes, it's all you need to verify that any transaction happened. Yes it's tamper proof.
But in reality, it can't show you the full side-effects of every transaction. In reality there are occasionally things things that require archival data. In reality, it's always easier to go to a centralised block explorer, or pay one of the few centralised API services (And I know this from experience, I've synced a full archival node back in 2019, and build a product that required querying it. It was such a pain that these days I'd highly recommend not doing that and just paying for API access)
In reality, the fact that you occationally need to go to etherscan to get the data you need, results in you just going to etherscan anyway, even for the simpler queries when you have a perfectly fine full node sitting there (again, personal experience). Hell, etherscan actually provides more data than an archival node, where else are you going to find the source code for contracts?
In reality... Most people don't even run light nodes. They certainly don't run full nodes. They just use etherscan, or whatever API their 3rd party wallet uses.
That's why in reality, access to ethereum is partially centralised around API providers. Yes, in theory anyone can go around them, set up their own node or create a competing API service at any time. But that's not what happens in reality, and when it comes to the topic of centralisation vs decentralisation, I'd argue that reality is far more important than theory.