So I can use a command line tool to process queries that are processing 100 TB o...

vidarh · on Jan 26, 2024

"Can be 235x faster" != "will always be 235x faster", nor indeed "will always be faster" or "will always be possible".

The point is not that there are no valid uses for Hadoop, but that most people who think they have big data do not have big data. Whereas your use case sounds like it (for the time being) genuinely is big data, or at least at a size where it is a reasonable tradeoff and judgement call.

To people's beliefs on this, here's a Forbes article on Big Data [1] (yes, I know Forbes is now a glorified blog for people to pay for exposure). It uses as example a company with 2.2 million pages of text and diagrams. Unless those are far above average, they fit in RAM on a single server, or on a small RAID array of NVMe drives.

That's not Big Data.

I've indexed more than that as a side-project on a desktop-class machine with spinning rust.

The people who think that is big data are the audience of this, not people with actual big data.

[1] https://www.forbes.com/sites/forbestechcouncil/2023/05/24/th...

pinkgolem · on Jan 26, 2024

for the given question, sure.. you can

there are 60tb ssd's out there.. you might even fit all of the 8tb on a given server

anthk · on Jan 26, 2024

Unix has a split(1) tool.

rakoo · on Jan 28, 2024

Have you read the article ?