Exactly. If I were asked this question during an interview, the first thing I'd say is "why should the client bother with anything more complex than jittered exponential backoff?"
I've dealt with backends that refresh a CSRF token on each valid request and return it in the response as a cookie. In those cases a solution like this may be needed. Not optimal but, we don't always have control over the backends we use, especially then they're provided by a third party.
I've had to implement this exact logic at work, because we have to talk to devices using modbus tcp, where a lot of devices only supports having one active request per connection at a time. One device we talk to only supports having 5 active connections to it.
I've dealt with low-limit APIs, and I've always considered the better approach to the problem is to create a proxy rather than trust the clients to manage, limit and coordinate themselves.
Then we send all traffic to proxy.service instead of canon.service, and implement queuing, rate limiting, caching, etc on the proxy.
This shields the weak part of the system from the clients.
I guess I hadn't considered the existence of generic TCP proxies
And I'd probably have concerns around how much latency it would introduce in an environment where we have some requirements on the rate of data collection.
That said our service Does act as a kind of proxy for a few protocols.
FWIW I’ve basically been given basically this exact requirement by a partner with a crappy API.
We’d get on calls with them and they’d be like “you can’t do multithreading!” we eventually parsed out that what they literally meant was that we could only make a single request to their API at a time. We’d had to integrate with them, and they weren’t going to fix it on their side.
(Our solve ended being a lot more complicated than this, as we had multiple processes across multiple machines that were potentially making concurrent requests.)
> Not actually that hard with a redis lock or any database (Postgres has a specific lock for this but you could also just use a record in a table)
Redis is just another SPOF, and so is Postgres without fiddly third party extensions (that are pretty unreliable in practice, IME). I'm talking about something truly distributed.
What, you need something truly "internet-scale" to make sure your thousands of clients can hit, sequentially, that one faulty api? Would you really be concerned more about Redis failure rates, than said API's failure rates?
If you get into that situation then it's probably because that API is critical and irreplaceable (otherwise you wouldn't be tolerating its problems), so you really don't want to get stuck and be unable to query it. And if you can tolerate a SPOF then there's no reason to bring Redis/Postgres into the picture, you might as well just have a single server doing it.
Plus it's just good practice that I'd want to be following anyway. Once you get in the habit of doing it it doesn't really cost much to design the dataflow right up-front, and it can save you from getting trapped down the line when it's much harder to fix things. Especially for an interview-type situation, why not design it right?
Does a truly distributed solution have no additional cost at all?
To be honest, for me, in an interview-type situation, if you insist that Redis is the problem in that scenario - you would have failed the interview (the interview is never one-way, interviewers can fail it too).
> Does a truly distributed solution have no additional cost at all?
If you literally just drop in etcd or Zookeeper rather than Redis and then develop in the same way then I'd say there's no additional cost to doing that. (I mean sure if you dig hard enough you can always find a way in which solution A is worse than solution B - e.g. most things have worse latency than Redis - but in this scenario the latency of the external API is going to make that irrelevant). Of course if you're just running those in single-node mode and developing against them without thinking about the distributed issues then you've still got plenty of ways to shoot yourself in the foot, but it's a small step in the right direction.
Developing more fully distributed from day 1 requires discipline that takes time to learn, but I'm not convinced that it's actually slower - I'd compare it to e.g. using a strongly typed language, where initially you spend a lot of time bouncing off the guardrails, but over time you adapt yourself and can be productive very rapidly on new projects.
> To be honest, for me, in an interview-type situation, if you insist that Redis is the problem in that scenario - you would have failed the interview (the interview is never one-way, interviewers can fail it too).
Interesting - to me Redis in a system design is very often a case of over-architecting. It's easy to use and programmers enjoy working with it, but very often it isn't letting you do anything you couldn't do without it, and while it can speed things up, I see a lot of cases where the thing it speeds up is something that was already fast enough.
TBH I didn't communicate that clearly - my point was not "Redis in particular", but "whatever you already have at hand, for this usecase". Could also be Postgres or another SQL server.
1 etcd pod doesn't give you "no SPOF", you need 3, and then you need them on multiple VMs (or physical machines if you're not on the cloud/not in k8s), and then the cluster needs to be multi-AZ, and if you're really serious about the "no spof" that may mean geo-redundancy too... come on, just the deployment costs alone are significant.
> my point was not "Redis in particular", but "whatever you already have at hand, for this usecase". Could also be Postgres or another SQL server.
But if you're in the habit of using HA-capable systems then whatever you have to hand will be HA-capable, and so there won't really be any additional cost to using that.
And again, I think there's a real antipattern where people take a single-server application and then claim they've made it fault tolerant by making it run on multiple hosts, but it's still relying on a single DB server. In my experience that doesn't actually improve reliability any (at least not if you've got a good deployment process for your single-server application) and it complicates your architecture to no real benefit. (Indeed, frankly, I think a lot of developers reach for an external database because they have no other idea how to store data from their application, when using embedded sqlite/hbase or - shudder! - the local filesystem, would let them use much simpler architecture and not really reduce the actual reliability of the system).
> 1 etcd pod doesn't give you "no SPOF"
No, but it gives you a clear path to removing your SPOF when the need arises. Which is much harder if you've built your system on Redis.
All the systems mentioned in this discussion are HA-capable (even Redis, for some usecases, is perfectly HA-capable; typically for a distributed lock it isn't appropriate, but then again, for the scenario under discussion, you don't need a perfectly safe distributed lock so it would work just fine).
The more interesting question is not whether a system is HA-capable, it's whether the system is appropriate for the job that's required of it (given said system weaknesses & strengths, plus the specific job needs). And my argument was that both Redis and Postgres were fine, for the job that was described. In an interview situation I want to see that my interviewer is capable of thinking through particular situations and having a good honest debate about strengths and weaknesses of a proposed solution _for a proposed problem_ - not just pushing their preferred solution as dogma. In many business scenarios it's fine & correct to architect systems as "HA by default" but in interview situations we're debating hypotheticals, and I am going to judge you based on the hypothetical at hand, not based on your day-to-day job, because I don't know what your day-to-day job is (and it's not what's being discussed).
> All the systems mentioned in this discussion are HA-capable (even Redis, for some usecases, is perfectly HA-capable
It really isn't, outside of some stretched definition. Nor is Postgres without third-party extensions (that come with significant issues in my experience).
> The more interesting question is not whether a system is HA-capable, it's whether the system is appropriate for the job that's required of it (given said system weaknesses & strengths, plus the specific job needs).
I used to believe this kind of thing, but I've come around to the opposite; actually rather than carefully considering the strengths and weaknesses of any given system in the context of a given job, it's a lot more efficient to have some simple heuristics that are easy to evaluate for which systems are good or bad, and avoid even considering bad systems. Of course occasionally you do need to dive into a full evaluation and pick your poison, but if a task doesn't have very specific requirements you avoid a lot of headache by just dismissing most of the possibilities out of hand.
> And my argument was that both Redis and Postgres were fine, for the job that was described.
But they're not contributing anything to the job that's described! Adding an extra moving part to the system that doesn't actually achieve anything is a much worse error than choosing the wrong system IMO.
As a cache, it is. All you need for a cache (if you're using it correctly, as a cache) is for the replica to be up, which it can be. Azure even gives you out-of-the-box multi-AZ replicated Redis with 99.99% promised uptime (and based on previous experience, I'd say they deliver on this promise).
> Adding an extra moving part
I specifically mentioned I considered those as good solutions for the problem at hand only if you already have them/ don't need to add them, that's their strength (lots of systems already use Redis or a SQL database, e.g. Postgres - but anything really would work just fine for the task at hand).
Because you only control the client, but you need to integrate with that broken server of a third party. It’s a pretty common situation to find oneself in.
Yup. I've even found myself in situations where the owner of the third-party service is another team or department within the organization I'm working for or partnering with. Oftentimes, the product/project people on our team tries to make it a business issue with the partner only to find that they don't have the leverage to effect a fix, they get told that the service doesn't offer the SLA you require, or you hear back from the team some hilarious quote like six weeks of development that can't begin until the next quarter. Meanwhile, your feature or product has to launch by the end of the current sprint or quarter.
What happens when the unstoppable force meets the immovable object? The unstoppable force works over the weekend to implement a store-and-forward solution.
A server that has a spike of load and can't cope with it is pretty normal, hard to characterize as "broken".
When the client(s) can send more work than the server can handle there are three options:
1 - Do nothing; server drops requests.
2 - Server notifies the clients (429 in HTTP) and client backs-off (exponential, jitter).
3 - Put the client requests in a queue.
Interview question/solution does 2 in a poor way (just adding a pause), it's part of the client and does 3 in the client, when usually this is done in an intermediate component (RMQ/Kafka/Redis/Db/whatever).
Because it was written in ALGOL 60, none of the mainframe devs are willing to touch that code, and the dozen other clients probably depend on the broken functionality.