The article discusses the specific use case of serverless computing, e.g. AWS Lambda, and how a central database doesn't always work well with apps constructed in a serverless fashion.
I was immediately interested in this post because 6-7 years ago I worked on this very problem- I needed to ingest a set of complex hierarchical files that could change at any time, and I needed to "query" them to extract particular information. FaaS is expensive for computationally expensive tasks, and it also didn't make sense to load big XML files and parse them every time I needed to do a lookup in any instance of my Lambda function.
My solution was to have a central function on a timer that read and parsed the files every couple of minutes, loaded the data into a SQLite database, indexed it, and put the file in S3.
Now my functions just downloaded the file from S3, if it was newer than the local copy or on a cold start, and did the lookup. Blindingly fast and no duplication of effort.
One of the things that is not immediately obvious from Lambda is that it has a local /tmp directory that you can read from and write to. Also the Python runtime includes SQLite; no need to upload code besides your function.
I'm excited that work is going on that might make such solutions even faster; I think it's a very useful pattern for distributed computing.
We have that issue at work, though I solved it by including the sqlite database within the container image that we use. We then deploy the new container image (with the same code as before, but with a revised database file) at most every fifteen minutes.
This gives you an atomic point at which you are 100% confident all instances are using the same database, and by provisioning concurrency, you can also avoid a "thundering herd" of instances all fetching from the file on S3 at startup (which can otherwise lead to throttling).
Of course, that's only feasible if it's acceptable that your data can be stale for some number of minutes, but if you're caching the way you are, and periodically checking S3 for an updated database, it probably is.
> "thundering herd" of instances all fetching from the file on S3 at startup (which can otherwise lead to throttling).
Have any "thundering herd" problems with S3, including throttling, actually been seen?
I think S3 is advertised to have no concurrent connection limit, and support up to at least 5,500 GETs per second (per "prefix", which I'm confused about what that means exactly in practice). I don't think S3 ever applies intentional throttling, although of course if you exceed it's capacity to deliver data you will see "natural" throttling.
Do you have a fleet big enough that you might be exceeding those limits, or have people experienced problems even well under these limits, or is it just precautionary?
I asked the S3 team what “prefix” meant at reinvent, and my current understanding is “whatever starting length of key gives a reasonable cardinality for your objects”.
So if your keys are 2024/12/03/22-45:24 etc, I would expect the prefix to be first 7 characters. If your keys are UUIDs I’d assume first two or three. For ULIDs I’d assume first 10. I this there’s a function that does stat analysis on key samples to figure out reasonable sharding.
Yep. Works similarly with google cloud storage buckets. It seems like the indexing function they use for splitting/distributing/sharding access looks at your objects keys and finds a common prefix to do this.
The problem with a date based key like the one you used (that's very common) is that if you read a lot of files that tend to be from the same date (for example: for data analysis you read all the files from one day or week, not files randomly distributed) all those files are going to share the same prefix and are going to be located in the same shard, reducing performance until the load is so high that Google splits that index in parts and begins to distribute your data in other shards.
For this reason they recommend to think your key name beforehand and split that prefix using some sort of random hash in a reasonable location of your key:
Yep, seems to hint something in the first paragraph of a performance tip [0] but it doesn't specify how does it choose prefixes, or how many prefixes does it shard, or anything...
I have never seen this explained, so thank you! Sounds like it's kind of "up to S3 and probably not predictable by you" -- which at least explains why it wasn't clear!
If you don't have "a lot" of keys, then you probably have only one prefix, maybe? Without them documenting the target order of magnitude of their shards?
I would assume so, the extreme case being just one key, which of course has only one partition. But see https://youtu.be/NXehLy7IiPM (2024 Reinvent S3 deep dive) - there’s still replication happening on single objects. So it’s still sort of sharded, but I do think key partitions where groups of keys have shared choke points based on sort order exist.
Sorry--the throttling was at the AWS Lambda layer, not S3. We were being throttled because we'd deploy a new container image and suddenly thousands of new containers are all simultaneously trying to pull the database file from S3.
We aim to return a response in the single digit milliseconds and sometimes get tens of thousands of requests per second, so even if it only takes a second or two to fetch that file from S3, the request isn't getting served while it's happening, and new requests are coming in.
You very quickly hit your Lambda concurrency limit and get throttled just waiting for your instances to fetch the file, even though logically you're doing exactly what you planned to.
By having the file exist already in the container image, you lean on AWS's existing tools for a phased rollout to replace portions of your deployment at a time, and every one is responding in single digit milliseconds from its very first request.
EDIT: The same technique could be applied for other container management systems, but for stuff like Kubernetes or ECS, it might be simpler to use OP's method with a readiness check that only returns true if you fetched the file successfully. And maybe some other logic to do something if your file gets too stale, or you're failing to fetch updates for some reason.
That's true. I prefer this approach because it removes that additional thing (the request to S3) that can be slow or fail at runtime. Or "initialization" time, I guess, depending on how you look at it.
Yes, I've been throttled many times by S3. My largest database is ingesting ~5PB/day and that turns into a lot of files in S3. At one point we changed our S3 key scheme to not have hashes up front, which unlocked some simplicity in control plane operations like deleting old files; we did this on the strength of the announcement from AWS that you no longer needed to get clever with prefixes.
This was incorrect at our scale, and we had to switch back.
I wrote a tool to handle micro blobs specifically because we were being heavily rate limited by S3 for both writes and reads. We got about 3k/s per bucket before S3 rate limiting started kicking in hard.
Granted we also used said tool to bundle objects together in a way that required sezo state to track so that we could fetch them as needed cheaply and efficiently so it wasn't a pure S3 issue.
Interesting, thanks! PUT is advertised at 3500/s, so with a combo load, you were at least within range of advertised limits. I have not approached that scale so didn't know, it was a real question!
Yeah I was processing a bunch of iceberg catalog data, it was pretty trivial to get to this point on both PUTs and GETs with our data volume, I was doing 400,000 requests/m and of course my testing was writing to one prefix :)
I actually versioned my database file - I had a small metadata table with version number and creation time.
Then in the output from each of my other functions, I included the database version number. So all my output could be subsequently normalized by re-running the same input versus an arbitrary version of the database file.
> One of the things that is not immediately obvious from Lambda is that it has a local /tmp directory that you can read from and write to.
The other big thing a lot of people don't know about Python on Lambda is that your global scope is also persisted for that execution context's lifetime like /tmp is. I ran into issues at one point with Lambdas that processed a high volume of data getting intermittent errors connecting to S3. An AWS engineer told me to cache my boto3 stuff (session, client, resources, etc.) in the global namespace, and that solved the problem overnight.
Generally people ignore the per PUT and GET pricing on S3 along with the higher latency since it's a "global" service. If your objects are small then you're almost always benefited from using DynamoDB as the GET pricing and latency are far more favorable, as long as you don't mind the region dependency or the multi region setup.
> Now my functions just downloaded the file from S3, if it was newer than the local copy
if you have strong consistency requirements, this doesn't work. synchronizing clocks reliably between different servers is surprisingly hard. you might end up working with stale data. might work for use cases that can accept eventual consistency.
If you have strong consistency requirements, then it doesn't work by the very nature of making multiple copies of the database. Even if the clocks are perfect. (Though the clocks are probably close enough that it doesn't matter.)
One of the announcements from AWS this year at Re:invent is that they now can guarantee that the instances clocks are synced within microseconds of each other. Close enough that you can rely on it for distributed timekeeping.
I don't really know if that matters for this use case. Just by the very nature of source_data -> processing -> dest_data taking nonzero time anything consuming dest_data must already be tolerant of some amount of lag. And how it's coded guarantees you can never observe dest_data going new -> old -> new.
Wouldnt e-tag version numbers also work? Or just havkng .jsom with version metadata next to the db blob? No need to sync clocks. Just GET the small db-ver.json and compare version details?
> My solution was to have a central function on a timer that read and parsed the files every couple of minutes, loaded the data into a SQLite database, indexed it, and put the file in S3.
I was immediately interested in this post because 6-7 years ago I worked on this very problem- I needed to ingest a set of complex hierarchical files that could change at any time, and I needed to "query" them to extract particular information. FaaS is expensive for computationally expensive tasks, and it also didn't make sense to load big XML files and parse them every time I needed to do a lookup in any instance of my Lambda function.
My solution was to have a central function on a timer that read and parsed the files every couple of minutes, loaded the data into a SQLite database, indexed it, and put the file in S3.
Now my functions just downloaded the file from S3, if it was newer than the local copy or on a cold start, and did the lookup. Blindingly fast and no duplication of effort.
One of the things that is not immediately obvious from Lambda is that it has a local /tmp directory that you can read from and write to. Also the Python runtime includes SQLite; no need to upload code besides your function.
I'm excited that work is going on that might make such solutions even faster; I think it's a very useful pattern for distributed computing.