Hacker Newsnew | past | comments | ask | show | jobs | submit | HermitX's commentslogin

Great blog, thanks for sharing!


Excellent content! It looks like the Iceberg + StarRocks open lakehouse Architecture is highly effective.


Great article, thank you for sharing! I have a question I’d like to discuss with the author. Spark SQL is a great product and works perfectly for batch processing tasks. However, for handling ad hoc query tasks or more interactive data analysis tasks, Spark SQL might have some performance issues. If you have such workloads, I suggest trying data lake query engines like Trino or StarRocks, which offer faster speeds and a better query experience.


(Notion employee)

AWS Athena packages Trino, I’ve been using it for some queries like “find all blocks that contain @-mentions”. It’s a great tool.


I believe Snowflake not only needs an open-source table format but also an open-source data lake query engine. Currently, Snowflake has too little presence in this area, and its stock price is reaching new lows.


Also, their AI/ML side of the house is a mess. :(


I suggest you try using StarRocks to query the data lake directly. I know many large companies, including Tencent and Pinterest, are doing this. StarRocks has a truly mature vectorized execution engine and a robust distributed architecture. It can provide you with impressive query performance.


“Specifically, online analytics workloads would be migrated to Druid/StarRocks”, I'm very interested in this part. Look forward to knowing more about it.


As long as the United States is the world’s leading superpower, there will always be a second place trying to challenge its position. Previously it was Japan, now it is China, and there will be other countries in the future.


When the times leave you behind, they won't even say goodbye.


Is ClickHouse a suitable engine for analyzing events? Absolutely, as long as you're analyzing a large table, its speed is definitely fast enough. However, you might want to consider the cost of maintaining an OSS ClickHouse cluster, especially when you need to scale up, as the operational costs can be quite high.

If your analysis in Postgres was based on multiple tables and required a lot of JOIN operations, I don't think ClickHouse is a good choice. In such cases, you often need to denormalize multiple data tables into one large table in advance, which means complex ETL and maintenance costs.

For these more common scenarios, I think StarRocks (www.StarRocks.io) is a better choice. It's a Linux Foundation open-source project, with single-table query speeds comparable to ClickHouse (you can check Clickbench), and unmatched multi-table join query speeds, plus it can directly query open data lakes.


> consider the cost of maintaining an OSS ClickHouse cluster I mean... it is pretty straightforward. 40~60 line Terraform, Ansible with templates for the proper configs that get exported from Terraform so you can write the IPs so they can see each other, and you are done.

What else could you possibly need? Backing up is built into it with S3 support: https://clickhouse.com/docs/en/operations/backup#configuring...

Upgrades are a breeze: https://clickhouse.com/docs/en/operations/update

People insist that OMG MAINTENANCE I NEED TO PAY THOUSANDS FOR MANAGED is better, when in reality, it is not.


As long as there is greed in human nature, people like SBF will continue to emerge.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: