Not a single mention of Cloudera Impala in the article? Competition in this space is great! Woud be great to know how this offering compares to Impala.
Thanks for bringing this up. A lot of what we say in the FAQ for "How does CitusDB's feature set compare against Apache Hive?" also applies to Impala, and we'll update that question shortly. The fundamental difference is that Citus builds on top of Postgres, and leverages its many features and performance optimizations.
We are also working on getting performance numbers that compare Hive, Impala, and Citus thoroughly; and we'll share our methodology and results in the upcoming months.
I wouldn't necessarily agree that the same feature set against hive would also apply to Impala. For example, Impala utilizes HDFS short-circuit reads and can read data directly from disk which results in full disk throughput, this combined with highly effecient parallel reads yields some impressive numbers.
I've seen queries speed up anywhere from 2x-100x (especially when data sets can fit in memory). Since it's designed for low latent queries, results can be returned within the sub-second range.
With that being said, Impala does not currently support UDFs (slated for post-GA).
Hive does do JOIN order optimizations after 0.7.0 though (https://issues.apache.org/jira/browse/HIVE-1642), you can set "hive.auto.convert.join = true" to enable it. I believe this will be enabled by default eventually. By GA, Impala will have a cost-based optimizer for optimizing JOINS as well.
PS: Congrats on the release, I'm looking forward to giving it a go :)
Hive is not meant for real-time queries. Hive would merely serve as a baseline for comparison; what will be interesting is how it compares against Impala. And Hadapt, as @mwexler points out, and maybe also RedShift and BigQuery :)
Is that really accurate? I had perceived things as the other way around, as Hadapt has a mixed storage model and CitusDB uses external tables for everything...
Congrats to the Citus Data team on a big release. These guys know distributed databases backwards and forwards. Excited to see how this product stacks up against Hive.
It definitely seems interesting. The big problem I've been looking for the right tool to fix is simple document based storage with solid secondary indices supporting aggregate queries. A SQL syntax is a big plus for this because it is very easy for many people to write a SQL group by statement to get the aggregates they want but much harder to write an ElasticSearch or Solr facet query or a MapReduce job. Especially if you want relatively fast results.