1. It uses a cache on local SSDs ("instance store" in AWS terms) + page cache in memory.
2. This is a write-through cache. The data is written into cache and S3.
3. Writes to S3 are synchronous. The write is confirmed only after the data is accepted by S3, and the metadata is accepted by ClickHouse Keeper.
4. The cache is ephemeral (not durable, not redundant, not essential for work).
5. The consistency requirements are fairly strict. Although it is still possible to have a stale read on a replica due to sequentially consistent replication of the metadata, it is not related to S3 or the cache either.
6. It is expected to have low latency on reads, while latency on writes can be tolerated up to around one second.
A table in ClickHouse is represented by a set of immutable data parts. SELECT query acquires a snapshot of currently active data parts and uses it for reading. INSERT query creates a new data part (or many) and links it to the set. Background merge operations (aka compactions) create new data parts from the existing and remove the existing data parts. The old data parts can still be used in a snapshot by some queries - they are ref-counted and will be removed later by a GC process. Updates are heavy (they are named "mutations"). Mutations create new data parts with modified data and remove the existing data parts. This is similar to merges. There are some optimizations, though: instead of a full copy of the data, a mutation can link the existing data and only add a small patch on top of it, should it be the row numbers of deleted records and the added records or a lazy expression to apply.
To summarize: - every data part is immutable; - inserts, selects, merges, and mutations work concurrently with no long locks; - it resembles a lock-free data structure, and we could almost say it is lock-free, but... there are mutexes inside for short locks.
We've made experiments, but the usage of this tier is still being considered.
One of the proposals is:
- use it to simplify the cache instead of local SSDs;
- it has to be in every AZ where there are replicas; the caches will work independently in each AZ;
- so it might be expensive, but if we make replica initialization time close to zero and have no local metadata at all, we can afford to run in one AZ at any moment in time and still have failover to another AZ with the cost of losing the cache.
PS. Snowflake has virtual warehouses in a single AZ by default. ClickHouse Cloud has nodes in three AZ (prod tier) or two AZ (dev tier).
Interesting article, though I wish there were some more concrete numbers, particularly around latency - what's the latency on normal S3, what is it on S3 Express One Zone, and what latency/throughput is necessary for the kind of systems he's talking about? Some cost estimates would also be interesting, though that would be harder to figure out.
What does high latency mean in this case? I've been in mobile for 15 years and now find myself using S3 as a document backup space. I'm rather alarmed because I'm only seeing, idk, 200-400 ms of latency. But then again, I'm in Boston. Am I just lucky to be near US-East?
From anywhere, S3 used to (10+ yrs ago) seem slow enough for a person to feel a delay, 400ms - 800ms plus RTT ping time, but people did and are seeing results sub-200ms perception threshold, i.e., "Average latency per file was 72.3ms" -- but some requests still take a literal second:
SQL queries on S3 data are affected by read latencies. When you're doing huge aggregations and joins say on Parquet files, the engine has to read the Parquet metadata multiple times, and read the relevant portions of the data. There are multiple reads going on, and every read incurs latency.
Implementing a caching layer really helps, but it's not out of the box.
Because people are re-architecting their databases around cloud object storage, the new hotness is to try and get all your persistence there, with on-box just as a cache. This lets you offload all of the hard parts of persistence onto the AWS team.
It is sensor data from tens of thousands of distributed sensors. Each sensor sends a reading every few seconds. We have a live view.in the frontend which is why latency plays a role.
It is sensor data from tens of thousands of distributed sensors. Each sensor sends a reading every few seconds. We have a live view.in the frontend which is why latency plays a role.
Keep in mind that "one zone" has durability implications:
> In the unlikely case of the loss or damage to all or part of an AWS Availability Zone, data in a One Zone storage class may be lost. For example, events like fire and water damage could result in data loss.
They're as bad as bigPharma, but they take the non-sense name and make it unpronounceable. At least we can say the AWS name even if it is just as meaningless
Perhaps something like objectiveFS might be worth looking into. Since it already creates a layer on top of S3 and tries to address all this and includes its own cache mechanism. No idea if it's performance is still too inconsistent or latency is too high when comparing it to the author's goals.
1. It uses a cache on local SSDs ("instance store" in AWS terms) + page cache in memory.
2. This is a write-through cache. The data is written into cache and S3.
3. Writes to S3 are synchronous. The write is confirmed only after the data is accepted by S3, and the metadata is accepted by ClickHouse Keeper.
4. The cache is ephemeral (not durable, not redundant, not essential for work).
5. The consistency requirements are fairly strict. Although it is still possible to have a stale read on a replica due to sequentially consistent replication of the metadata, it is not related to S3 or the cache either.
6. It is expected to have low latency on reads, while latency on writes can be tolerated up to around one second.
7. We also tested S3 Express One Zone: https://aws.amazon.com/blogs/storage/clickhouse-cloud-amazon...