What was the data store you were working on? Is it open source?
My experience was the same as you.
What helped me was discovering all the fantastic storage and file system papers coming out of the University of Wisconsin Madison, supervised by Remzi and Andrea Arpaci-Dusseau.
Their teams have studied and documented almost all aspects of what is required to write reliable storage systems, even diving into interactions between local storage failures and global consensus protocols, how a single disk block failure can destroy Raft and Zookeeper. Most safety testing of these systems tends to focus on the network fault model. I think in a few years time we'll all look back and see how today we had almost no concept of a storage fault model. It's kind of exciting to think that there's going to be a new breed of replicated databases that are far more reliable than today's systems. On the another hand, perhaps the future is already here, just not very evenly distributed.
My experience was the same as you.
What helped me was discovering all the fantastic storage and file system papers coming out of the University of Wisconsin Madison, supervised by Remzi and Andrea Arpaci-Dusseau.
Their teams have studied and documented almost all aspects of what is required to write reliable storage systems, even diving into interactions between local storage failures and global consensus protocols, how a single disk block failure can destroy Raft and Zookeeper. Most safety testing of these systems tends to focus on the network fault model. I think in a few years time we'll all look back and see how today we had almost no concept of a storage fault model. It's kind of exciting to think that there's going to be a new breed of replicated databases that are far more reliable than today's systems. On the another hand, perhaps the future is already here, just not very evenly distributed.
http://pages.cs.wisc.edu/~remzi/
Their OSTEP book (Operating Systems in Three Easy Pieces) is also a great fun read: http://pages.cs.wisc.edu/~remzi/OSTEP/