> you don’t know what dimensions are important in advance
But again this is a huge red flag. I've seen so many data science projects that started with "well, let's just get started with collecting everything and we will figure out what is important later on" and then spent so much time on infrastructure that no useful insights were ever produced.
There's a balance to that. You don't need anything fancier than a Hadoop cluster to store everything. Nowadays you can get that packaged and working out of the box from a number of vendors.
Of course get that data out into analytical datasets is a whole different matter.
But again this is a huge red flag. I've seen so many data science projects that started with "well, let's just get started with collecting everything and we will figure out what is important later on" and then spent so much time on infrastructure that no useful insights were ever produced.