Would love to see if indexes and a sane schema were used for the RDBMS case. I'v...

ironchef · on Nov 9, 2012

"exceedingly well for very adhoc queries" vs "If your query patterns are even somewhat predictable and occur frequently"

Aren't those ... largely opposite? To say "very adhoc" i would anticipate that as largely meaning "not predictable". Also...can you perhaps quantify "very large"? How many terabytes? I've done a decent amount of work with the "new school" OLAP approaches (hadoop / mapreduce, etc.) and found them to work quite well especially in certain cases such as time series (think weblog analysis) where sequential scanning is a simplistic approach.

meritt · on Nov 9, 2012

In a dimensional modeling approach you need to identify what elements/attributes users would query upon ahead of time. This leads to support of adhoc queries -- "Show me minutely clicks & revenue from 11am-1pm on Mondays in 2011 except holidays for publishers (A,B) against advertisers (A,B,C)" As long as you define the grain of your data, adding new attributes for dimensions is very easy and flexible.

I guess I'm comparing it against better-suited-for-brute-force approaches where someone is analyzing a log file for really random things that tend to be a one-time thing. "Show me hits to this particular resource from IP addresses which match this pattern and the user-agent contains Safari and the response time is larger than 300ms and the response size is less than 100ms!". While you could fit this data easily into a DM, you'd need to plan ahead for that sort of querying. If it's a non-frequent occurrence, it'd make more sense to process it sequentially (even if its across 10,000 machines in a hadoop cluster).

When I left the company, our Greenplum cluster (so a bit of both worlds: RDBMS cluster that automatically parallelizes queries across multiple nodes and aggregates the results) was around 500T. This approach was scaled up from a single MySQL instance though, which were seeing around 5 million new rows per day for one particular business channel.

I'm not suggesting that "new school" approaches do not work nor not necessarily work quickly. What I am suggesting is this: MR is a very naive approach that is only "fast" due to executing the problem in parallel across many nodes. If you have datasets which are going to be queried often in similar use cases, one should take the past ~30 years of innovation in RDBMS' instead of masking the difficulties by throwing a lot of CPU (and therefore money) at a problem and solving it in the most inefficient manner possible. It pains me to see people coming up with overly complex "solutions" to basic OLAP needs on Hadoop-based or even NoSQL platforms instead of simply using the right tool for the job.

That said, the one-off cases where it doesn't make sense to build out a schema, ETL pipeline, and managing a database because it's a very niche or one-time need: That's where the real value of Hadoop/MR comes into play.