> Not actually that hard with a redis lock or any database (Postgres has a specific lock for this but you could also just use a record in a table)
Redis is just another SPOF, and so is Postgres without fiddly third party extensions (that are pretty unreliable in practice, IME). I'm talking about something truly distributed.
What, you need something truly "internet-scale" to make sure your thousands of clients can hit, sequentially, that one faulty api? Would you really be concerned more about Redis failure rates, than said API's failure rates?
If you get into that situation then it's probably because that API is critical and irreplaceable (otherwise you wouldn't be tolerating its problems), so you really don't want to get stuck and be unable to query it. And if you can tolerate a SPOF then there's no reason to bring Redis/Postgres into the picture, you might as well just have a single server doing it.
Plus it's just good practice that I'd want to be following anyway. Once you get in the habit of doing it it doesn't really cost much to design the dataflow right up-front, and it can save you from getting trapped down the line when it's much harder to fix things. Especially for an interview-type situation, why not design it right?
Does a truly distributed solution have no additional cost at all?
To be honest, for me, in an interview-type situation, if you insist that Redis is the problem in that scenario - you would have failed the interview (the interview is never one-way, interviewers can fail it too).
> Does a truly distributed solution have no additional cost at all?
If you literally just drop in etcd or Zookeeper rather than Redis and then develop in the same way then I'd say there's no additional cost to doing that. (I mean sure if you dig hard enough you can always find a way in which solution A is worse than solution B - e.g. most things have worse latency than Redis - but in this scenario the latency of the external API is going to make that irrelevant). Of course if you're just running those in single-node mode and developing against them without thinking about the distributed issues then you've still got plenty of ways to shoot yourself in the foot, but it's a small step in the right direction.
Developing more fully distributed from day 1 requires discipline that takes time to learn, but I'm not convinced that it's actually slower - I'd compare it to e.g. using a strongly typed language, where initially you spend a lot of time bouncing off the guardrails, but over time you adapt yourself and can be productive very rapidly on new projects.
> To be honest, for me, in an interview-type situation, if you insist that Redis is the problem in that scenario - you would have failed the interview (the interview is never one-way, interviewers can fail it too).
Interesting - to me Redis in a system design is very often a case of over-architecting. It's easy to use and programmers enjoy working with it, but very often it isn't letting you do anything you couldn't do without it, and while it can speed things up, I see a lot of cases where the thing it speeds up is something that was already fast enough.
TBH I didn't communicate that clearly - my point was not "Redis in particular", but "whatever you already have at hand, for this usecase". Could also be Postgres or another SQL server.
1 etcd pod doesn't give you "no SPOF", you need 3, and then you need them on multiple VMs (or physical machines if you're not on the cloud/not in k8s), and then the cluster needs to be multi-AZ, and if you're really serious about the "no spof" that may mean geo-redundancy too... come on, just the deployment costs alone are significant.
> my point was not "Redis in particular", but "whatever you already have at hand, for this usecase". Could also be Postgres or another SQL server.
But if you're in the habit of using HA-capable systems then whatever you have to hand will be HA-capable, and so there won't really be any additional cost to using that.
And again, I think there's a real antipattern where people take a single-server application and then claim they've made it fault tolerant by making it run on multiple hosts, but it's still relying on a single DB server. In my experience that doesn't actually improve reliability any (at least not if you've got a good deployment process for your single-server application) and it complicates your architecture to no real benefit. (Indeed, frankly, I think a lot of developers reach for an external database because they have no other idea how to store data from their application, when using embedded sqlite/hbase or - shudder! - the local filesystem, would let them use much simpler architecture and not really reduce the actual reliability of the system).
> 1 etcd pod doesn't give you "no SPOF"
No, but it gives you a clear path to removing your SPOF when the need arises. Which is much harder if you've built your system on Redis.
All the systems mentioned in this discussion are HA-capable (even Redis, for some usecases, is perfectly HA-capable; typically for a distributed lock it isn't appropriate, but then again, for the scenario under discussion, you don't need a perfectly safe distributed lock so it would work just fine).
The more interesting question is not whether a system is HA-capable, it's whether the system is appropriate for the job that's required of it (given said system weaknesses & strengths, plus the specific job needs). And my argument was that both Redis and Postgres were fine, for the job that was described. In an interview situation I want to see that my interviewer is capable of thinking through particular situations and having a good honest debate about strengths and weaknesses of a proposed solution _for a proposed problem_ - not just pushing their preferred solution as dogma. In many business scenarios it's fine & correct to architect systems as "HA by default" but in interview situations we're debating hypotheticals, and I am going to judge you based on the hypothetical at hand, not based on your day-to-day job, because I don't know what your day-to-day job is (and it's not what's being discussed).
> All the systems mentioned in this discussion are HA-capable (even Redis, for some usecases, is perfectly HA-capable
It really isn't, outside of some stretched definition. Nor is Postgres without third-party extensions (that come with significant issues in my experience).
> The more interesting question is not whether a system is HA-capable, it's whether the system is appropriate for the job that's required of it (given said system weaknesses & strengths, plus the specific job needs).
I used to believe this kind of thing, but I've come around to the opposite; actually rather than carefully considering the strengths and weaknesses of any given system in the context of a given job, it's a lot more efficient to have some simple heuristics that are easy to evaluate for which systems are good or bad, and avoid even considering bad systems. Of course occasionally you do need to dive into a full evaluation and pick your poison, but if a task doesn't have very specific requirements you avoid a lot of headache by just dismissing most of the possibilities out of hand.
> And my argument was that both Redis and Postgres were fine, for the job that was described.
But they're not contributing anything to the job that's described! Adding an extra moving part to the system that doesn't actually achieve anything is a much worse error than choosing the wrong system IMO.
As a cache, it is. All you need for a cache (if you're using it correctly, as a cache) is for the replica to be up, which it can be. Azure even gives you out-of-the-box multi-AZ replicated Redis with 99.99% promised uptime (and based on previous experience, I'd say they deliver on this promise).
> Adding an extra moving part
I specifically mentioned I considered those as good solutions for the problem at hand only if you already have them/ don't need to add them, that's their strength (lots of systems already use Redis or a SQL database, e.g. Postgres - but anything really would work just fine for the task at hand).
Redis is just another SPOF, and so is Postgres without fiddly third party extensions (that are pretty unreliable in practice, IME). I'm talking about something truly distributed.