I wouldn't necessarily agree that the same feature set against hive would also apply to Impala. For example, Impala utilizes HDFS short-circuit reads and can read data directly from disk which results in full disk throughput, this combined with highly effecient parallel reads yields some impressive numbers.
I've seen queries speed up anywhere from 2x-100x (especially when data sets can fit in memory). Since it's designed for low latent queries, results can be returned within the sub-second range.
With that being said, Impala does not currently support UDFs (slated for post-GA).
Hive does do JOIN order optimizations after 0.7.0 though (https://issues.apache.org/jira/browse/HIVE-1642), you can set "hive.auto.convert.join = true" to enable it. I believe this will be enabled by default eventually. By GA, Impala will have a cost-based optimizer for optimizing JOINS as well.
PS: Congrats on the release, I'm looking forward to giving it a go :)
I wouldn't necessarily agree that the same feature set against hive would also apply to Impala. For example, Impala utilizes HDFS short-circuit reads and can read data directly from disk which results in full disk throughput, this combined with highly effecient parallel reads yields some impressive numbers.
I've seen queries speed up anywhere from 2x-100x (especially when data sets can fit in memory). Since it's designed for low latent queries, results can be returned within the sub-second range.
With that being said, Impala does not currently support UDFs (slated for post-GA).
Hive does do JOIN order optimizations after 0.7.0 though (https://issues.apache.org/jira/browse/HIVE-1642), you can set "hive.auto.convert.join = true" to enable it. I believe this will be enabled by default eventually. By GA, Impala will have a cost-based optimizer for optimizing JOINS as well.
PS: Congrats on the release, I'm looking forward to giving it a go :)