Hacker News new | past | comments | ask | show | jobs | submit login

Yep. Works similarly with google cloud storage buckets. It seems like the indexing function they use for splitting/distributing/sharding access looks at your objects keys and finds a common prefix to do this.

The problem with a date based key like the one you used (that's very common) is that if you read a lot of files that tend to be from the same date (for example: for data analysis you read all the files from one day or week, not files randomly distributed) all those files are going to share the same prefix and are going to be located in the same shard, reducing performance until the load is so high that Google splits that index in parts and begins to distribute your data in other shards.

For this reason they recommend to think your key name beforehand and split that prefix using some sort of random hash in a reasonable location of your key:

https://cloud.google.com/storage/docs/request-rate#naming-co...




It would be nice if S3 provided similar public guidance. For instance:

> Adding a random string after a common prefix still allows auto-scaling to work, but…

No way to know if that's true of S3's algorithm too without them revealing it.


Yep, seems to hint something in the first paragraph of a performance tip [0] but it doesn't specify how does it choose prefixes, or how many prefixes does it shard, or anything...

  0: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: