Great article, thank you for sharing! I have a question I’d like to discuss with the author. Spark SQL is a great product and works perfectly for batch processing tasks. However, for handling ad hoc query tasks or more interactive data analysis tasks, Spark SQL might have some performance issues. If you have such workloads, I suggest trying data lake query engines like Trino or StarRocks, which offer faster speeds and a better query experience.
I believe Snowflake not only needs an open-source table format but also an open-source data lake query engine. Currently, Snowflake has too little presence in this area, and its stock price is reaching new lows.
I suggest you try using StarRocks to query the data lake directly. I know many large companies, including Tencent and Pinterest, are doing this. StarRocks has a truly mature vectorized execution engine and a robust distributed architecture. It can provide you with impressive query performance.
“Specifically, online analytics workloads would be migrated to Druid/StarRocks”, I'm very interested in this part. Look forward to knowing more about it.
As long as the United States is the world’s leading superpower, there will always be a second place trying to challenge its position. Previously it was Japan, now it is China, and there will be other countries in the future.
Is ClickHouse a suitable engine for analyzing events? Absolutely, as long as you're analyzing a large table, its speed is definitely fast enough. However, you might want to consider the cost of maintaining an OSS ClickHouse cluster, especially when you need to scale up, as the operational costs can be quite high.
If your analysis in Postgres was based on multiple tables and required a lot of JOIN operations, I don't think ClickHouse is a good choice. In such cases, you often need to denormalize multiple data tables into one large table in advance, which means complex ETL and maintenance costs.
For these more common scenarios, I think StarRocks (www.StarRocks.io) is a better choice. It's a Linux Foundation open-source project, with single-table query speeds comparable to ClickHouse (you can check Clickbench), and unmatched multi-table join query speeds, plus it can directly query open data lakes.
> consider the cost of maintaining an OSS ClickHouse cluster
I mean... it is pretty straightforward. 40~60 line Terraform, Ansible with templates for the proper configs that get exported from Terraform so you can write the IPs so they can see each other, and you are done.