Do you have any examples of companies building Hadoop clusters for amounts of da...

dijit · on Jan 26, 2024

I did some data processing at Ubisoft.

each node in our hadoop cluster had 64GiB of ram (which is the max amount you should have for a single node java application, where 32G is allocated for heap FWIW), we had I think 6 of these nodes for a total of 384GiB memory.

Our storage was something like 18TiB across all nodes.

It would be a big machine, but our entire cluster could easily fit. Largest machine on the market right now is something like 128CPU's and 20TiB of Memory.

384GiB was available in a single 1U rackmount server at least as early as 2014.

Storage is basically unlimited with direct-attached-storage controllers and rackmount units.

genewitch · on Jan 27, 2024

I had an HP from 2010 that supported 1.5TB of ram with 40 cores, but it was 4U. I'm not sure what the height has to do with memory other than a 1U doesn't have the luxury of the backplane(s) being vertical or otherwise above the motherboard, so maybe it's limited space?

dijit · on Jan 27, 2024

Theres different classes of servers, the 4U ones are pretty much as powerful as it gets, many sockets (usually 4) and a huge fabric.

1Us are extremely commodity, basically as “low end” as it gets, so I like to use them as if they are a baseline.

A 1U that can take 1.5TiB of ram might be part of the same series of machines that might have a 4U machine that could do 10TiB. But those are hugely expensive. Both to buy and to run

wantoncl · on Jan 26, 2024

> Do you have any examples of companies building Hadoop clusters for amounts of data that fit on a single machine?

I was a SQL Server DBA at Cox Automotive. Some director/VP caught the Hadoop around 2015 and hired a consultant to set us up. The consultant's brother worked at Yahoo and did foundational work with it.

Consultant made us provision 6 nodes for Hadoop in Azure (our infra was on Azure Virtual Machines) each with 1 TB of storage. The entire SQL Server footprint was 3 nodes and maybe 100 GB at the time, and most of that was data bloat. He complained about such a small setup.

The data going into Hadoop was maybe 10 GB, and consultant insisted we do a full load every 15 minutes "to keep it fresh". The delta for a 15 minute interval was less than 20 MB, maybe 50 MB during peak usage. Naturally his refresh script was pounding the primary server and hurting performance, so we spent additional money to set up a read replica for him to use.

Did I mention the loading process took 16-17 minutes on average?

You can quit reading now, this meets your request, but in case anyone wants a fuller story:

Hadoop was used to feed some kind of bespoke dashboard product for a customer. Everyone at Cox was against using Microsoft's products for this, while the entire stack was Azure/.Net/SQL Server...go figure. Apparently they weren't aware of PowerBI, or just didn't like it.

I asked someone at MS (might have been one of the GuyInACube folks, I know I mentioned it to him) to come in and demo PowerBI, and in a 15 minute presentation absolutely demolished everything they had been working on for a year. There was a new data group director who was pretty chagrined about it, I think they went into panic mode to ensure the customer didn't find out.

The customer, surprisingly, wasn't happy with the progress or outcome of this dashboard, and were vocally pointing out data discrepancies compared to the production system. Some of them days or even a week out of date.

Once the original contract was up, and time to renew, the Hadoop VP now had to pay for the project from his budget, and about 60 days later it was mysteriously cancelled. The infra group was happy, as our Azure expenses suddenly halved, and our database performance improved 20-25%.

The customer seemed to be happy, they didn't have to struggle with the prototype anymore, and wow, where did all these SSRS reports that were perfectly fine come from? What do you mean they were there all along?

geraldwhen · on Jan 26, 2024

Developers are taught that you must scale horizontally. They become seniors and managers and ruin everything they touch.

I have to teach developers that yes, we can have a 500MB data cache in ram, and that’s actually not a lot at all.

oblio · on Jan 26, 2024

I used to work for a pretty famous 2nd tier US company (smaller and less cool than FAANG).

They had a team working on a Hadoop based solution and their biggest internal implementations was about what you're describing, in practice.

It makes sense because internal politics.

jerven · on Jan 26, 2024

In 2014 I was at Oracle Open World. A 3rd party hardware vendor was saying (and having customers) for Hadoop "clusters" that had 8 cpu cores. Basically their pitch was that Oracle Hardware (ex sun) started at a dense full rack of about a 1 million USD or so, but with the 3d party you could have a hadoop "cluster" in 2U and for 20K. The oracle thing was actually quite price competitive at the time, if you needed hadoop. The 3rd party thing was overpriced for what it was. Yet, I am sure that 3rd party hardware vendor made out like bandits.

jl6 · on Jan 26, 2024

I worked at a corp that had built a Hadoop cluster for lots of different heterogeneous datasets used by different teams. It was part of a strategy to get "all our data in one place". Individually, these datasets were small enough that they would have fitted perfectly fine on single (albeit beefy for the time) machines. Together, they arguably qualified as big data, and justification for the decision to use Hadoop was because some analytics users occasionally wanted to run queries that spanned all of these datasets. In practice, these kind of queries were rare and not very high value, so the business would have been better off just not doing them, and keeping the data on a bunch of siloed SQL Servers (or, better, putting some effort into tiering the rarely used data onto object storage).

ianburrell · on Jan 26, 2024

I wonder if companies built Hadoop clusters for large jobs and then also use them for small ones.

At work, they run big jobs on lots of data on big clusters. The processing pipeline also includes small jobs. It makes sense to write them in Spark and run them in the same way on the same cluster. The consistency is a big advantage and that cluster is going to be running anyway.