S3 Express One Zone, not quite what I hoped for

zX41ZdbW · on Dec 17, 2023

A small correction about ClickHouse Cloud:

1. It uses a cache on local SSDs ("instance store" in AWS terms) + page cache in memory.

2. This is a write-through cache. The data is written into cache and S3.

3. Writes to S3 are synchronous. The write is confirmed only after the data is accepted by S3, and the metadata is accepted by ClickHouse Keeper.

4. The cache is ephemeral (not durable, not redundant, not essential for work).

5. The consistency requirements are fairly strict. Although it is still possible to have a stale read on a replica due to sequentially consistent replication of the metadata, it is not related to S3 or the cache either.

6. It is expected to have low latency on reads, while latency on writes can be tolerated up to around one second.

7. We also tested S3 Express One Zone: https://aws.amazon.com/blogs/storage/clickhouse-cloud-amazon...

jvanlightly · on Dec 17, 2023

Author here: Regarding point 3, it was my understanding that inserts are synchronous but updates may be asynchronous. Is that the case?

zX41ZdbW · on Dec 17, 2023

No, updates are also synchronous.

A table in ClickHouse is represented by a set of immutable data parts. SELECT query acquires a snapshot of currently active data parts and uses it for reading. INSERT query creates a new data part (or many) and links it to the set. Background merge operations (aka compactions) create new data parts from the existing and remove the existing data parts. The old data parts can still be used in a snapshot by some queries - they are ref-counted and will be removed later by a GC process. Updates are heavy (they are named "mutations"). Mutations create new data parts with modified data and remove the existing data parts. This is similar to merges. There are some optimizations, though: instead of a full copy of the data, a mutation can link the existing data and only add a small patch on top of it, should it be the row numbers of deleted records and the added records or a lazy expression to apply.

To summarize: - every data part is immutable; - inserts, selects, merges, and mutations work concurrently with no long locks; - it resembles a lock-free data structure, and we could almost say it is lock-free, but... there are mutexes inside for short locks.

ec109685 · on Dec 17, 2023

With the express one zone architecture, will you move serving to be single AZ so you don’t have to pay cross az charges to read the cached data?

zX41ZdbW · on Dec 17, 2023

We've made experiments, but the usage of this tier is still being considered.

One of the proposals is: - use it to simplify the cache instead of local SSDs; - it has to be in every AZ where there are replicas; the caches will work independently in each AZ; - so it might be expensive, but if we make replica initialization time close to zero and have no local metadata at all, we can afford to run in one AZ at any moment in time and still have failover to another AZ with the cost of losing the cache.

PS. Snowflake has virtual warehouses in a single AZ by default. ClickHouse Cloud has nodes in three AZ (prod tier) or two AZ (dev tier).

DylanSp · on Dec 17, 2023

Interesting article, though I wish there were some more concrete numbers, particularly around latency - what's the latency on normal S3, what is it on S3 Express One Zone, and what latency/throughput is necessary for the kind of systems he's talking about? Some cost estimates would also be interesting, though that would be harder to figure out.

JoshTriplett · on Dec 17, 2023

I'd love to see those latency numbers for S3 Standard and S3 Express One Zone as well.

refulgentis · on Dec 17, 2023

What does high latency mean in this case? I've been in mobile for 15 years and now find myself using S3 as a document backup space. I'm rather alarmed because I'm only seeing, idk, 200-400 ms of latency. But then again, I'm in Boston. Am I just lucky to be near US-East?

trillic · on Dec 17, 2023

200ms is enough for an RTT from Boston-Auckland-Boston. S3 is slow. You’re single digit milliseconds from us-east-1 under optimal conditions.

Terretta · on Dec 17, 2023

Yep, using a ping history:

Boston - DC shows about 13ms: https://wondernetwork.com/pings/Washington/Boston

Boston - Auckland shows about 207ms: https://wondernetwork.com/pings/Washington/Auckland

From anywhere, S3 used to (10+ yrs ago) seem slow enough for a person to feel a delay, 400ms - 800ms plus RTT ping time, but people did and are seeing results sub-200ms perception threshold, i.e., "Average latency per file was 72.3ms" -- but some requests still take a literal second:

https://github.com/opendatacube/benchmark-rio-s3/blob/master...

Even a 15 years ago, though, people were seeing sub-100ms to S3:

https://www.quora.com/What-are-typical-latencies-for-static-...

isignal · on Dec 17, 2023

For a key value store you expect server side latencies less than 5 ms for point gets and writes.

refulgentis · on Dec 17, 2023

Thank you!! Exactly what I was curious about: not "is 200 ms slow relative to some other time?" but "_why_ is 200 ms slow?"

up2isomorphism · on Dec 17, 2023

Nobody seems to bother but why anybody who use a cloud object storage will care about latency?

wenc · on Dec 17, 2023

For me, I want to be able to query data faster.

SQL queries on S3 data are affected by read latencies. When you're doing huge aggregations and joins say on Parquet files, the engine has to read the Parquet metadata multiple times, and read the relevant portions of the data. There are multiple reads going on, and every read incurs latency.

Implementing a caching layer really helps, but it's not out of the box.

klooney · on Dec 17, 2023

Because people are re-architecting their databases around cloud object storage, the new hotness is to try and get all your persistence there, with on-box just as a cache. This lets you offload all of the hard parts of persistence onto the AWS team.

Gasp0de · on Dec 17, 2023

How would you store petabytes of data with a low latency instead? We didn't really find a better solution.

up2isomorphism · on Dec 21, 2023

Would like to hear about the use case, we mostly found they are throughput sensitive rather than latency sensitive.

Gasp0de · on Dec 23, 2023

It is sensor data from tens of thousands of distributed sensors. Each sensor sends a reading every few seconds. We have a live view.in the frontend which is why latency plays a role.

awiesenhofer · on Dec 17, 2023

Whats your use case if I might ask? What kind of data is it?

Gasp0de · on Dec 23, 2023

It is sensor data from tens of thousands of distributed sensors. Each sensor sends a reading every few seconds. We have a live view.in the frontend which is why latency plays a role.

JoshTriplett · on Dec 17, 2023

One nice thing about this: despite the "one zone" in the name, it still has the durability of standard S3, just not the availability.

jpittis · on Dec 17, 2023

Keep in mind that "one zone" has durability implications:

> In the unlikely case of the loss or damage to all or part of an AWS Availability Zone, data in a One Zone storage class may be lost. For example, events like fire and water damage could result in data loss.

jnsaff2 · on Dec 17, 2023

Or yet another example of horrible naming from AWS.

dylan604 · on Dec 17, 2023

They're as bad as bigPharma, but they take the non-sense name and make it unpronounceable. At least we can say the AWS name even if it is just as meaningless

camkego · on Dec 17, 2023

Any S3 experts in the house?

Does Amazon S3 use erasure coding or a similar scheme to increase durability?

Is erasure coding or another redundancy/durability scheme the primary cause of the high latency?

Does Minio have a similar latency profile? What about other S3 software solutions.

I know it's a lot, but maybe someone wants to share?

If there are any papers or articles on the specifics of this, please share!

ec109685 · on Dec 17, 2023

This is a good write up: https://www.allthingsdistributed.com/2023/07/building-and-op...

jmpman · on Dec 17, 2023

https://m.youtube.com/watch?si=M2dVLHnmHsTG47XR&t=23m25s&v=s...

mike503 · on Dec 17, 2023

Perhaps something like objectiveFS might be worth looking into. Since it already creates a layer on top of S3 and tries to address all this and includes its own cache mechanism. No idea if it's performance is still too inconsistent or latency is too high when comparing it to the author's goals.