Thats interesting, i don't see these as occupying the same space. Clickhouse is ...

FridgeSeal · on May 24, 2023

But CH is capable of the same “data warehousing” features that snowflake is. Which leaves snowflake as a slower, less capable, less open, and more expensive alternative.

Which brings me to the next point: I’m convinced the delineation between “data warehouse” and “olap” is largely a marketing move designed to segment the market along made up boundaries.

glogla · on May 25, 2023

Snowflake and ClickHouse are very different in their focus.

Snowflake is focused on enterprise customers. It has a lot of features focused on that, like very granular security and governance and data marketplace. There's also some non-enterprise features that ClickHouse lacks, like the ability to execute Python in database (so you can bring ML in).

But the biggest difference is that Snowflake is storage segregated architecture. Scaling Snowflake is done by running "alter warehouse resize" or something. You can also dedicate specific compute slices to specific users and scale them up and down as needed. And this is all managed for you.

If you want to run ClickHouse at scale, you have to run your own k8s, figure out how to manage persistent storage, figure out how to replicate your data, manage cluster replicated tables, etc. Once you outgrow single instance, things get exponentially more difficult - both for the admins and for the users.

Also, while ClickHouse can do joins and is getting better and better optimizer as we speak, and is probably faster than Snowflake for the same money on "single big table analytics" kind of workload, I would expect it to perform much worse in traditional analytics queries, like you would find in TPC-DS.

hodgesrm · on May 25, 2023

> If you want to run ClickHouse at scale, you have to run your own k8s, figure out how to manage persistent storage, figure out how to replicate your data, manage cluster replicated tables, etc. Once you outgrow single instance, things get exponentially more difficult - both for the admins and for the users.

This greatly overstates the difficulty of running ClickHouse as well as the current state of the market.

1. ClickHouse has a good Kubernetes operator written by Altinity that manages most of the basic Kubernetes operations. It's used to operate many thousands of ClickHouse clusters worldwide both self-managed environments as well as multiple SaaS offerings of ClickHouse. (Disclaimer: it's written by my company.)

2. If you don't want the trouble of running ClickHouse there are now multiple cloud vendors in every geographic region offering ClickHouse-as-a-Service. Among other things competition keeps prices reasonable and ensures plenty of choice for users.

There are real differences between Snowflake and ClickHouse but ease of operation is no longer one of them. For example one major difference between Snowflake and ClickHouse from a user perspective is the following: You can develop great Snowflake applications just with a knowledge of SQL whereas for ClickHouse you really have to know how it works inside.

zX41ZdbW · on May 26, 2023

ClickHouse allows to running Python code inside the database for ML.

See the presentation for examples: https://presentations.clickhouse.com/meetup74/ai/

atwong · on May 25, 2023

CH timeouts using joins on TPCH test data. https://celerdata.com/blog/clickhouse-vs.-starrocks-a-detail...

riku_iki · on May 25, 2023

> you have to run your own k8s, figure out how to manage persistent storage, figure out how to replicate your dat

I think they have option of standalone cluster, where all of this kinda easy to configure.

5Qn8mNbc2FNCiVV · on May 25, 2023

Self managing storage is never easy to configure, especially not when it's storage that you want to access in a timely manner

glogla · on May 25, 2023

Yeah, that would be my answer as well. I actually forgot to mention that - Snowflake and the like store data away from compute so no matter how you misconfigure clusters (though Snowflake isn't really that configurable) the data is safe. Messing up database that stores data locally means the data is gone - and that makes all operations like resizes and upgrades much more scary.

But of course the local storage is much faster. Tradeoffs.

I know ClickHouse Cloud uses S3 as well, but I don't know much about it, so I don't want to comment on it.

FridgeSeal · on May 31, 2023

ClickHouse supports separate compute and storage too. I was using it to query data in object storage the other week.

riku_iki · on May 25, 2023

> Self managing storage is never easy to configure, especially not when it's storage that you want to access in a timely manner

Signing for cloud infra also adds lots of complexity and risks.

sv123 · on May 25, 2023

We use both MS SQL and Snowflake heavily. There are clearly instances where having row based storage is appropriate, and also instances where columnar storage outperforms. All based on your workload and not just marketing.

FridgeSeal · on May 25, 2023

MSSQL is an OLTP based db (going to preclude discussion of its fancy column index stuff it’s capable of). OLTP db’s definitely, definitely have a different role.

I’m talking about the false difference between the likes of ClickHouse and Snowflake, where they’re both column oriented already. I’m asserting that the fundamental differences between “classic” column db’s and “data warehouses” is far less fundamental than the marketing would have us believe. Some of the db’s in this space have slightly different architectures and trade offs, and some deliberately operate at different scales, but they are built for, and operate in, basically the same purpose.

berkle4455 · on May 25, 2023

I don't think anyone disagrees with you. Clickhouse and Snowflake are both OLAP; neither are row-based (OLTP).

fuy · on May 25, 2023

There's columnar tables in SQL Server, though. Have you tried it? Would be interesting to compare to Snowflake.

FridgeSeal · on May 26, 2023

As useful and powerful as the columnar tables in mssql are, they’re not on the same level as a full columnar db.

ownagefool · on May 25, 2023

I think it's more borne of the lack of scaling capabilities in the traditional sql databases, and I guess a lack of capability in summarising data.

In reality, you can probably scale something like vitess pretty far, and then by adding your own summary tables on top, you're probably good for most usecases.

I'm not an expert on this level of the stack though, so I'm probably missing a whole bunch of context.

riku_iki · on May 24, 2023

> doing large distributed joins

but ch supports large distributed joins?..

quadrature · on May 25, 2023

Thanks, looks like i need to update my priors.

atwong · on May 25, 2023

not really. CH times out on TPCH join scenarios. https://celerdata.com/blog/clickhouse-vs.-starrocks-a-detail...

riku_iki · on May 26, 2023

that link doesn't provide much details how they try to test ch for joins, and if they tried to test it at all..

qoega · on May 27, 2023

I think atwong just promotes his product https://news.ycombinator.com/threads?id=atwong

berkle4455 · on May 24, 2023

ch cluster works just fine on large distributed joins

atwong · on May 25, 2023

not really. CH clusters timeout on join testing on TPCH. https://celerdata.com/blog/clickhouse-vs.-starrocks-a-detail...