> Drives do fail, and if a single drive failure brings down prod how long would ...

jcgrillo · on Jan 26, 2024

> degenerated level of sysadmin competence that we forgot even what RAID is

At the risk of troll-feeding, what are you hoping to accomplish with this? Of course I haven't "forgot even what RAID is", and I'm confident my competence is not "degenerated".

In this magical world where we can fit the entire "data lake" on one box of course we can replicate with RAID, but you've still got a spof. So this only works if downtime is acceptable, which I'll concede maybe it could be iff this box is somehow, magically, detached from customer facing systems.

But they never really are. Assuming even that there aren't ever customer impacting reads from this system, downtime in the "data lake" means all the systems which write to it have to buffer (or shed) data during the outage. Random, frequent off-nominal behavior is a recipe for disaster IME. So this magic data box can't be detached, really.

I've only ever worked at companies which are "always on" and have multi-petabyte data sets. I guess if you can tolerate regular outages and/or your working data set is so small that copying it around willy-nilly is acceptably cheap go for it! I wish my life was that simple.

dijit · on Jan 26, 2024

I'm certainly not trolling, but unfortunately I think you've completely misunderstood the context of this entire discussion.

If you really have multi-petabyte datasets then probably you are at the scale where distributed storage and systems will be superior.

The point of this conversation is that most people are not at this scale but think they are. IE: they sincerely believe that a dataset does not fit in ram of a single box because it's 1TiB or they think because it doesn't sit on a single 16TiB drive then a distributed system is the only solution.

The original post is an argument about that; that a single node can outcompete a large cluster, so you should avoid clustering until it really cannot fit on a single box anymore.

Your addendum was reliability is a large factor. Mostly this does not bear resemblance with reality. You might be surprised to learn that reliability follows a curve where you get very close to high reliability with a single machine, you diminish it enormously with a distributed system and then start approaching higher reliability when you have a lot more effort into your distributed system.

My comment about RAID was simply because it's very obvious that a single drive failure should not be taking a single machine down, similarly a CPU fault or memory fault can also be configured to not take down a machine. That you didn't understand this was either a failing of our industry knowledge; or, if you did understand this then the comment was disingenuous and intentionally misleading- which is worse.

I've also only worked at companies that were "always on" but that's less true than you think also.

I have never worked anywhere that insisted that all machines are on all the time, which is really what you're arguing. There is no reason to have a processing box turned on when there's no processing that's required.

Storage and aggregation: sure, those are live systems and should be treated as such, but it is never a single system that both ingests and processes. Sometimes they have the same backing store, but usually there is an ETL process and that ETL process is elastic, bursty, etc. and its outputs are what people are actually doing reports based on.

trashtester · on Jan 26, 2024

RAID is not sufficient to protect against data loss. If anything, it can provide a false sense of protection.

dijit · on Jan 26, 2024

RAID is literally designed to prevent data corruption using parity from data and gives resilience in the event of drive failures, even intermittent ones.

Like all "additional components", RAID controllers come with their own quirks and I have heard of rare cases of RAID controllers being the cause of data loss, but RAID as a concept is designed to combat bit-rot and lossy hardware.

ZFS in the same vein is also designed around this concept and attempts to join RAID, an LVM and a filesystem to make "better" choices on how to handle blocks of data. Since RAID only sees raw blocks and is not volume or filesystem aware there are cases where it's slower.

That said, I have to also mention that when I was investigating HBASE there was no way to force consistency of data, there was no fsync() call in the code, it only writes to VFS and you have to pray your OS flushes the cache to disk before the system fails. HBASE Parity is configured by HDFS which is essentially doing exactly what RAID does. Except only to VFS and without parity bits.

zdragnar · on Jan 26, 2024

Would it be fair to say that "preventing data loss", broadly speaking, requires defense in depth, and that RAID alone is not sufficient?

If so, then both things in the gp are true: raid isn't enough, and can be a false sense of security.

dijit · on Jan 26, 2024

Given the context, why would raid and hdfs not be equivalent?

ianburrell · on Jan 26, 2024

RAID is distributed across drives on one machine. That whole machine can fail. Plus, it can take a while to recover the machine or array and it is common for another drive to fail during recovery.

HDFS is distributed across multiple machines, each one which can have RAID. It is unlikely that enough machines will fail to lose data.

dijit · on Jan 26, 2024

I believe that its essentially equivalent and neither raid nor hdfs are good enough to exist without backups.