Hacker News new | past | comments | ask | show | jobs | submit | n_u's comments login

What I’ve always been curious about is if you can help the S3 query optimizer* in any way to use specialized optimizations. For example if you indicate the data is immutable[1] does the lack of a write path allow further optimization under the hood? Replicas could in theory serve requests without coordination.

*I’m using “query optimizer” rather broadly here. I know S3 isn’t a DBMS.

[1] https://aws.amazon.com/blogs/storage/protecting-data-with-am...


It’s not addressed directly but I do think the article implies you hope your request latencies are not correlated. It provides a strategy for helping to achieve that

> Try different endpoints. Depending on your setup, you may be able to hit different servers serving the same data. The less infrastructure they share with each other, the more likely it is that their latency won’t correlate.


I found a broadcast[1] of it and it seems the humans and robots run in separate lanes. It also looks like there are humans walking/running with the robots to hold them up if they start to fall over. Some of them need it and some of them seem fine running on their own.

[1] https://www.youtube.com/watch?v=lEYVbq7OF3w start at 1:05:00 for video of running robots


Do you have any benchmarks for the pattern you described where channels are more efficient?

> sync.Mutex, if left to wait long enough will enter a slow code path and if memory serves, will call out to a kernel futex. The channel will not do this because the mutex that a channel is built with is exists in the go runtime

Do you have any more details about this? Why isn’t sync.Mutex implemented with that same mutex channels use?

> [we] use channels extensively in a message passing style. We do not use channels to increment a shared number

What is the rule of thumb your Go shop uses for when to use channels vs mutexes?


> Do you have any benchmarks for the pattern you described where channels are more efficient?

https://go.dev/play/p/qXwMJoKxylT

go test -bench=.* -run=^$ -benchtime=1x

Since my critique of the OP is that it's a contrived example, I should mention so is this: the mutex version should be a sync.Atomic and the channel version should have one channel per goroutine if you were attempting to write a performant concurrent counter, both of those alternatives would have low or zero lock contention. In production code, I would be using sync.Atomic, of course.

On my 8c16t machine, the inflection point is around 2^14 goroutines - after which the mutex version becomes drastically slower; this is where I believe it starts frequently entering `lockSlow`. I encourage you to run this for yourself.

> Do you have any more details about this? Why isn’t sync.Mutex implemented with that same mutex channels use?

Why? Designing and implementing concurrent runtimes has not made its way onto my CV yet; hopefully a lurking Go contributor can comment.

The channel mutex: https://go.dev/src/runtime/chan.go

Is not the same mutex as a sync.Mutex: https://go.dev/src/internal/sync/mutex.go

If I had to guess, the channel mutex may be specialised since it protects only enqueuing or dequeuing onto a simple buffer. A sync.Mutex is a general construct that can protect any kind of critical region.

> What is the rule of thumb your Go shop uses for when to use channels vs mutexes?

Rule of thumb: if it feels like a Kafka use case but within the bounds of the local program, it's probably a good bet.

If the communication pattern is passing streams of work where goroutines have an acyclic communication dependency graph, then it's a no brainer: channels will be performant and a deadlock will be hard to introduce.

If you are using channels to protect shared memory, and you can squint and see a badly implemented Mutex or WaitGroup or Atomic; then you shouldn't be using channels.

Channels shine where goroutines are just pulling new work from a stream of work items. At least in my line of work, that is about 80% of the cases where a synchronization primitive is used.


Thanks for the example! I'll play around with it.

> On my machine, the inflection point is around 10^14 goroutines - after which the mutex version becomes drastically slower;

How often are you reaching 10^14 goroutines accessing a shared resource on a single process in production? We mostly use short-lived small AWS spot instances so I never see anything like that.

> Why? Designing and implementing concurrent runtimes has not made its way onto my CV yet; hopefully a lurking Go contributor can comment. > If I had to guess, the channel mutex may be specialised since it protects only enqueuing or dequeuing onto a simple buffer. A sync.Mutex is a general construct that can protect any kind of critical region.

Haha fair enough, I also know little about mutex implementation details. Optimized specialized tool vs generic tool feels like a reasonable first guess.

Though I wonder if you are able to use channels for more generic mutex purposes is it less efficient in those cases? I guess I'll have to do some benchmarking myself.

> If the communication pattern is passing streams of work where goroutines have an acyclic communication dependency graph, then it's a no brainer: channels will be performant and a deadlock will be hard to introduce.

I agree with your rules, I used to always use channels for single processt thread-safe queues (similar to your Kafka rule) but recently I ran into a cyclic communication pattern with a queue and eventually relented to using a Mutex. I wonder if there are other painful channel concurrency patterns lurking for me to waste time on.


> How often are you reaching 10^14 goroutines accessing a shared resource on a single process in production? We mostly use short-lived small AWS spot instances so I never see anything like that.

I apologize, that should've said 2^14, each sub-benchmark is a doubling of goroutines.

2^14 is 16000, which for contention of a shared resource is quite a reasonable order of magnitude.


Strong yes for 3 reasons

1. Reducing dev friction.

When I had managers who coded they were ruthless about removing friction in the dev and deployment pipeline because they had to deal with it too. If build times went up, deployment infrastructure broke or someone’s PR broke dev they would roll it back immediately. If someone consistently blocks PRs the manager noticed the trend and would address it.

2. You get a much better sense of IC’s contributions by writing code.

There are ICs who play politics very well and sell themselves but that set is not the same as the ICs who deliver. If you are writing code you start to notice which ICs have written key features, built critical APIs or worked on hard problems because of comments and Git blame.

3. Understanding your codebase.

I hope most managers have solid CS and engineering fundamentals but that is a necessary but not sufficient condition to grasping the full picture. There’s a reason it takes time to ramp up to full productivity on a new codebase. If you work in the codebase and have had to use that one annoying but critical library or dealt with that tech debt from 2 years ago then you know what is hard and what isn’t. I’ve found when a codebase has a quirk that makes developing certain features hard all of the non-technical people keep forgetting why we can’t do that thing and all the technical people have it burned into their brains.


> 1. Reducing dev friction.

This is so important, my managers who didn't code pretended things weren't too bad and took a "just deal with it" attitude whenever I proposed going for a QoL improvement.


On the flip side I had a manager who had written a lot of the codebase before I joined and had a terrible time allowing anyone to touch his precious baby, regardless of how much his prior ”art” was hurting us, our productivity, and by extension the company.


Yeah... I said it was a "flaming pos" I felt bad about that. But he won't let me add tests so idk whatever. (we prototype software so speed is the main goal but yeah, stuff starts breaking, backtracking, hey... tests)


Also, I'm just fundamentally skeptical you can do a good job of running a team, or hiring, when you don't know how to do the thing the team does. Software development skill requires active use/work to maintain it.


Here here. There are a lot of decisions where there is no real contest between the choices to someone who has tried both options but are difficult to tell apart from a distance. EMs should be in a position where they are trying things in practice.

I'd draw an example of someone who hasn't used git before, making a choice between a git repo and managing code by keeping daily .zip files. Anyone (almost anyone) who is a career coder won't see a choice there.

That example is so basic I think most EM would get that right even if they didn't deal in code but the same dynamic turns up at every level of work. There are situations where there is a right option, the right option is obvious to everyone who is working on it and it is is a drain on the org when management gets confused and thinks that something that isn't an option is viable because they aren't on the ground working on it.


I don't see why you need to be writing code to understand all of this. It can help, but almost everything you said is ascertained from daily syncs


Daily syncs allow the loud, not necessarily the most productive, to dominate. You cannot judge someones actual contributions by what they say.


I didn't say you should be judging performance solely on daily syncs, but you have a myriad of ways of doing that as a manager including simply looking at the big-picture content of PRs, what issues a dev identified and solved, what devs contribute to discussions/technical solutions in slack/meetings, what projects are completed, and how well they turn out etc.

But eh? Doesn't matter how loud you are in a sync, you can very easily go off the actual content of what someone is saying. If someone goes off 5 minutes about how they managed to turn an object into a json string, that doesn't exactly make them look good.


You don’t need to be writing code. But it’s a convenient shortcut to a great many things that otherwise take dedicated effort to understand.

Some managers will do that. Most won’t. Given that, it’s easier to just tell them all to code.


Or just tell them all to regularly communicate/listen to their team? Sounds way more efficient.


If a manager communicated with and listened to their team, but their team had written a web service with gaping security holes or disastrous data integrity practices because all their senior engineers were incompetent and/or were hired at a level that was above their ability, would that manager find out just from chatting with them?

I promise you that it's not guaranteed. You need to actually go looking through the code to find everything that's wrong.


I think you can certainly learn a lot by being curious and fishing out anything that smells, yeah.

But agreed that youll find more mistakes, if your manager also happens to be the best IC on the team.


Doesn't work, because the team won't always tell you of the issues that block them (normalization of deviance). Sometimes you need to find out yourself.


You can see this when you join a (disfunctional) new team, notice the friction, then suggest improving it.


Fewer or shorter daily syncs is a plus


I don’t see how you can understand something without being part of it.

It is like learning stuff from the book vs learning hands on. There is no book that will teach you skiing.

The same for working with the team - it is so much different than listening and trying to understand.

Everyone nags about how MBA graduates ruin everything by thinking that you can manage and it doesn’t matter who and what.


But you are a part of it. You are the manager of the codebase, you should actively be part of the discussions of what's being merged in, what your architecture looks like, what changes are required to complete a new project, what issues are arising, what the blockers are on projects, whats slowing down your team etc. None of that requires you to sit down and code.

If you're really listening and asking the right questions you should be aware of even changes like "Were deciding to use this HTTP client rather than what we currently have". OK why are we doing that? What was the issue? Ask ask ask.

As a manager Id argue you have (or should have) more technical insight into your whole codebase than any IC


After thinking about what you wrote here I have some conclusion:

If someone is "good manager" that does all those things whether he writes the code or not he is going to be a "good manager" anyway. Explicitly writing code might not be best use of time but hey if person feels like he needs it that is on him.

If someone is "bad manager" that doesn't bother to deal with technical details and wants only to do "important management stuff" and thinks he can manage by proxies like counting story points or counting closed tasks, does not care about HTTP client A or B and learning the system, he is going to be a "bad manager" and will never even care about writing code.

Finally "bad manager" and "good manager" - is hard to tell because "bad manager" can be good for the company or a team as much as "good manager" and depending on many other factors it can be that "down to earth, hands dirty, good manager" might be really bad for the company or a team depending on business context.


How do you do any of that without extensively reading code? Do you consider reading large volumes of code not "sit[ting] down and cod[ing]"? Just because you're not actually producing large volumes of commits does not mean that it's not coding.


I think "coding" is commonly understood as writing code, not reading code, you know like if I say "I coded google jamboard", someone thinks I developed it, not read the source code. But besides, why do you need to extensively read code to understand whats going on?

Assume you read your code base once and understood it. You get a feature, discuss how its going to be done (well add this table, add these endpoints, etc). You should have a damn good idea already what thats going to look like in your codebase. I don't think knowing all the low level details is necessary (most people would call that micromanagement) and besides writing code yourself isnt going to help know all the low level details of all the other projects


> or worked on hard problems because of comments and Git blame.

Oh lord we'd better hope they have absolutely IMPECCABLE git fu if they are going to be using this metric. Unfortunately here on HN I've seen people essentially brag that they only know just enough git to get by and "who cares if I don't know all the other commands deeply." In any event, this scenario REQUIRES that a manager know exactly how to determine who originally introduced something, or, exactly where it was significantly improved if they are going to be reading comments and blaming to see "who performs."

The very fact a manager might be doing this has got me a little worked up, mainly as I know great managers who don't do this and who are scared of something as simple as the reflog.


I tend to agree but, playing devil's advocate, is this true for other roles? Does a movie director need to know how to build sets? How to sew costumes? How to use Blender/Maya/Houdini? My manager can code, used to code, sometimes does code, but they aren't familiar with their team's current work.

Like imagine you were a coding manager 10 years ago with AI experience. Sometime over the last 10 years your team does AI infra. You, as a manager and as an IC, have zero AI experience (you've never trained a model, never used a trained model, never using any of the various AI frameworks). Are you still okay to manage this team or should you be replaced with someone who does have that experience?


Toyota calls it the gemba walk. Managers need to see how the factory is running with their own eyes. Not just live behind a desk and listen to what they hear in meetings.

A movie director can see the sets with their own eyes. But you can't see the state of a software codebase without reading and understanding the code, and the most surefire way to do that is to try to write something, even just documentation.

You don't assess the state of your software by walking around the office and looking at hands on keyboards. You look at the codebase.


> I tend to agree but, playing devil's advocate, is this true for other roles? Does a movie director need to know how to build sets? How to sew costumes? How to use Blender/Maya/Houdini?

I don't know that much about movie making, but my understanding is that there would be managers and/or leads within each specialty, who are (among other things) managing the interaction between their specialty and the director / producers.

That seems pretty comparable to what's being discussed here.


In any industry, if you want a team to work well, you have to have someone with both authority and hands-on experience who’s responsible for providing day-to-day guidance. Sometimes that person is called a “supervisor” or “tech lead” instead of “manager”, although this typically implies some division of responsibilities as well; no reason the person providing guidance necessarily has to be the same person reporting to leadership or hiring and firing.


> I tend to agree but, playing devil's advocate, is this true for other roles? Does a movie director need to know how to build sets? How to sew costumes? How to use Blender/Maya/Houdini? My manager can code, used to code, sometimes does code, but they aren't familiar with their team's current work.

Many directors started in other roles in the movie industry, typically as writers, PAs, or other subspecialties. Chad Stahelski was a stuntman and stunt coordinator before he started directing John Wick, and it really shows.

I think the clear distinction is between someone who understands a part of the job, and someone who is good at part of the job. If you don't understand how costuming works, as a director, you're going to have a hard time getting good costumes, but by no means does that mean you're able to pinch hit in that role. I personally believe that it's difficult to replace hands on experience as a way to truly understand something.

In software engineering, I think there's a huge gap between managers who worked in some other industry and transferred over, versus having previously been an engineer, even a mediocre one. Knowing how the sausage is made is hard to replace.


Though the fact that directors have certain biases from how they working into the role does also highlight an issue with this kind of effect: when you have technical leads or project managers on a big multi-disciplinary project, they will have a natural tendency to favor the areas they are more familiar with, and bias the decision-making and planning of the project around that. It can be difficult to step back and optimize for the project/system as a whole.


> When I had managers who coded they were ruthless about removing friction in the dev and deployment pipeline because they had to deal with it too.

For me a good manager is a facilitator, not a leader. Someone who removes obstacles for us. Whether they themselves are affected or not. Someone only fixing an issue because they have to deal with it too seems like a pretty bad manager to me.

They're not for pushing targets or trying to weed out non-performance, I don't work at a playschool. My manager is there to make sure I can do my job and that I can reach my maximum potential (including making sure I'm in the right job)


When the company tells your manager "we need to cut wood" and you tell your manager "I need to sharpen my axe", these things are in harmony but it's still a balancing act. The manager should trust your judgement, but they may also have a better view of the short-vs-long-term tradeoffs, and sometimes we spend too much time sharpening. Sometimes we don't spend enough.

I think a good manager should be able to take a swing with the axe to get a feel for its sharpness.


Friction can be more of a problem too though. If your manager is objectively better than the team, estimates can get cut short and failing to meet those adds tension.

Obviously a good manager might pitch in, understand their teams capabilities but it's not always a natural transition for senior devs moving to management.


I think every team needs a TL. If the EM isn't filling that role, then another team member should be, and most of what you're talking about falls on the TL (with some sanity checking from the EM by talking to other team members about these things as well)


Well said — completely true in my experience. It’s called Engineering Manager for a reason!


I think it depends on how it is done, and the kind of ICs you have on the team. It can come off as micromanagement, which may work well enough if you have not-so-competent ICs, but will backfire if you have talented ones.


I've found it really helpful to be a support programmer. not someone that takes on big tasks. nothing with a hard deadline. not something that someone else needs to do their work. leftover cleanup. testing. minor refactoring. build.

you need to keep your hand in the game just to understand what's going on with the codebase. but you're not an a-list player here.


> It can come off as micromanagement, which may work well enough if you have not-so-competent ICs, but will backfire if you have talented ones.

Yep agreed - I've seen a couple of managers that were probably fine as developers but struggled (to their extreme detriment) with being pretty average compared to the senior developers that they were managing. Their 'helpful advice' just served to show how superficial their understandings of the systems were.


Just curious have you managed people? At what capacity (tl? Em? Pm?)? How big was your team? What was the company env like in which your team(s) functioned?


I guess I should have added some context. I have been (and keep swinging between) ic and management roles (including managers) regularly. I love coding and try to sneak in some when I have time (as a manager).

But that a manager should always code is not something i found helping the team or the manager - all the time. One size does not fit all. In startups yes frankly there is hardly a need for a manager and it is TL, TPM, EM role combined into one.

In larger cos though a most managers are innundated with all kinds of non technical work (meetings, alignment, perf management, product discussions etc). While having coded before is a great thing keeping uptodate is actually robbing the manager of time for all other things on the plate (and those actually benefit the team beyond what meets the eye).

Besides at large orgs there is also so much technical (think large scale design and integrations) knowledge that a manager needs to keep track of which also needs time investment.

Then there are various level/career related things that necessitate one or more TLs a manager needs to work with or manage and coding often gets seen as a manager "not doing their job" or worse stealing a junior engineers opportunities.

There's a lot more that is very environmental but hope sets some context.


just curious what's your ssn, dob and mother's maiden name.


At the same time, they also tend to start interfering in the solutions proposed by their team. It hard to stave off that temptation.


First: it's not "interference" if they are also part of the team.

Second: If their ass is on the line, then they DO get a bigger say. They are paid for seeing potential problems, guiding the team, among other things.


In practice, many act as an intermediary who can take the credit for the wins while passing down blame for the misses.

It's not a good leadership trait but it's an effective career advancing move.

The entire list on the post reeks of aspirational intermediary that doesn't actually do any of those things as effectively as empowered project/team leads who do contribute to the product. It's fluff and very easy fluff to remove without feeling pain. Of course, mediocre teams will have mediocre developers who won't want responsabilty and will benefit from intermediary "bossy" managers.


Man, I wish the manager's ass was ever on the line. The amount of times I've seen a manager's whole team get laid off and the manager get moved to a different team to fuck everything up again is too many times.

I think I've seen a manager get laid off never. And often seen half their team laid off because they were terrible at their job, but the management class takes care of their own.


Yeah if I implemented my manager's ideas, I'd be the one fixing them too. No thanks - if I have to deal with the problems, I'll decide the solutions too.


How is manager's ass more on the line compared to an IC ?

Managers always have a higher job security compared to an IC from my experience.

Poor managers always use this dumb excuse of 'ass on line' to override good decisions by ICs with their own shitty decisions.


> It is a curly-brace language

What does this mean? Not Python?


A language that demarks code blocks with curly braces {} instead of whitespace, begin/end keywords, round brackets (), etc.

They often share other syntax similarities but not one particular common set across all of them.


Looks like you are also hiring a distributed systems engineer https://news.ycombinator.com/item?id=42920292

From that listing "Sync Speed: Customers want to sync a lot of data to important destinations like Facebook and Snapchat, which requires us to analyze every part of our syncing process and find where we can optimize to sync data more quickly"

I'm curious about this. What workflow requires syncing high volumes of data from a CDW to Facebook & Snapchat at low latency? It's my understanding that businesses mostly use those platforms for advertising. I'm struggling to think of a use case where you want to adjust your advertising with low latency and lots of data? I could understand feeding lots of data from your CDW into a ML model that updates your ads through the FB Ads API but I can't see why

1. it needs to go straight from CDW to FB ?

2. it needs to be a lot of data?

3. it needs to be fast?

Perhaps there is some other use-case besides adjusting ads.

4. Also why do you use the word "syncing" rather than "send"? I tend to think of syncing involving multiple programs that can edit data (e.g. Google Docs, distributed consensus etc.). Are Facebook and Snapchat actually updating the data you send and you have to sync the other direction? Or is just one-way?


I work on the syncing team at Hightouch. These are great questions and also good feedback on how we could be clearer when describing the problems we need to solve.

1. We also support the case you describe, in which an ML model processes data and then updates properties in a destination. However, customers still get a lot of value out keeping downstream systems synced with their warehouse tables. For instance, you can define which people you want to receive different campaigns and make sure that's consistent across all your ad platforms. You can also use it for simple projects like easily keeping Airtable in sync with a Postgres database.

2. Some people have warehouse tables with many billions of rows.

3. If you have a billion rows, you need to hit a very high rows per second number in order to run a sync in a feasible amount of time. Also, we have an event collection product, which allows customers to feed events into Hightouch in realtime, and a personalization API product, which allows customers to hit an API and get a low-latency response for how a given user's experience should be personalized. Making sure that the new data flowing into the events API is processed, and data is ready in the personalization API for fast fetching, needs to be fast.

4. It's true that syncing often implies some bi-directionality. In this case we think about "syncing" the destination system state to that present in the source system. It's nice because you can use the source system as the source of truth and trust that any edits you make will be reflected elsewhere. Possibly across many destinations.

Is this helpful?


1. > However, customers still get a lot of value out keeping downstream systems synced with their warehouse tables. For instance, you can define which people you want to receive different campaigns and make sure that's consistent across all your ad platforms.

Ah thank you that's a very helpful example for understanding your product.

2. > Some people have warehouse tables with many billions of rows.

I don't doubt customers have multi billion row tables in their CDWs but I guess I'm not seeing why you would need to send billions of rows to FB (or any other downstream system that isn't an OLAP database) rather than some much smaller payload distilled from that data via an ML model or SQL query. I admittedly have never run a FB ad campaign but Meta has 3.35 billion daily active users [1] across all of their products. If they are sending billions of rows to FB, do your customers have individualized ad campaigns for every single FB user? Perhaps I'm just not familiar enough with the state of modern digital advertising.

3. > If you have a billion rows, you need to hit a very high rows per second number in order to run a sync in a feasible amount of time.

I wonder why do you have to send billions of rows to FB every time? Surely you send it once for initial setup and then incrementally sync smaller deltas? And presumably your customer is OK with the initial setup being slower. Unless your customer is doing billions of writes to their CDW in between syncs?

Thanks for taking the time to explain!

[1] https://s21.q4cdn.com/399680738/files/doc_financials/2024/q4... page 10


You're right that typically the day to day delta we need to sync is much smaller. However, customers often want to change something for a large fraction of their dataset, requiring a large update. Also we support a our Personalization API product, for which customers do frequently want to refresh what personalizations they're showing to all of their users. For this we do need to be regularly syncing all the data.


They say the "stop the world" approach that causes more downtime is

  Turn off all writes.
  Wait for 16 to catch up
  Enable writes again — this time they all go to 16
and instead they used a better algorithm:

  Pause all writes.
  Wait for 16 to catch up. 
  Resume writes on 16.
These seem pretty similar.

1. What is the difference in the algorithm? Is it just that in the "stop the world" approach the client sees their txns fail until "wait for 16 to catch up" is done? Whereas in the latter approach the client never sees their txns fail, they just have a bit more latency?

2. Why does the second approach result in less downtime?


> in the "stop the world" approach the client sees their txns fail until "wait for 16 to catch up" is done? Whereas in the latter approach the client never sees their txns fail, they just have a bit more latency?

Yes, this is the main difference. For "stop the world", we imagined a simpler algorithm: instead of a script, we could manually toggle a switch for example.

However, by writing the script, the user only experiences a bit more latency, rather than failed transactions.


> If we went with the ‘stop the world approach’, we’d have about the same kind of downtime as blue-green deployments: a minute or so.

> After about a 3.5 second pause [13], the failover function completed smoothly! We had a new Postgres instance serving requests

> [13] About 2.5 seconds to let active queries complete, and about 1 second for the replica to catch up

Why is the latter approach faster though? It seems in the "stop the world" approach wouldn't it still take only 1 second for the replica to catch up? Where do the other ~59 seconds of write downtime come from?


In the "stop the world approach", I imagined our algorithm to be a bit more manual: for example, we would turn the switch on manually, wait, and then turn it back on.

You make a good point though, that with enough effort it could also be a few seconds. I updated the essay to reflect this:

https://github.com/instantdb/instant/pull/774/files


Did you test the "stop the world" approach? I wonder how the write downtime compares. It seems the 1 second of replication lag is unavoidable. The arbitrary 2.5 seconds of waiting for txns to finish could be removed by just killing all running txns, which your new approach does for txns longer than 2.5 seconds already.

> ;; 2. Give existing transactions 2.5 seconds to complete.

> (Thread/sleep 2500)

> ;; Cancel the rest

> (sql/cancel-in-progress sql/default-statement-tracker)

Then you have 2.5 seconds less downtime and I think you can avoid the problem of holding all connections on one big machine.

> Our switching algorithm hinges on being able to control all active connections. If you have tons of machines, how could you control all active connections?

> Well, since our throughput was still modest, we could temporarily scale our sync servers down to just one giant machine

> In December we were able to scale down to one big machine. We’re approaching the limits to one big machine today. [15] We’re going to try to evolve this into a kind of two-phase-commit, where each machine reports their stage, and a coordinator progresses when all machines hit the same stage.

I guess it depends on what your SLO is. With your approach only clients with txns longer than 2.5 seconds started before the upgrade see them fail, whereas with the "stop the world" approach there would be a period lower-bounded by the replication lag time where all txns fail.

Cool work thanks for sharing!

Edit: I feel like a relevant question regarding the SLO I'm not considering is how txns make their way from your customers to your DB? Do your customers make requests to your API and your application servers send txns to your Postgres instance? I think then you could set up a reasonable retry policy in your application code and use the "stop the world" approach and once your DB is available again the retries succeed. Then your customers never see any txns fail (even the long-running ones) and just a slight increase in latency. If you are worried about retrying in cases that are not related to this upgrade you could change the configuration of your retry policy shortly before/after the upgrade. Or return an error code specific to this scenario so your retry code knows.

Then you get the best of both worlds: no downtime perceivable to customers, no waiting for 2.5 seconds, and you don't have to write a two-phase-commit approach for it to scale.

If your customer sends txns to your Postgres instance directly, this wouldn't work I think.


To add onto this I feel like one of the hard things about TF is that there is like at least 3 ways to do everything because they have supported multiple APIs and migrated to eager. So if you find an example or an open source project it might not be for the flavor of tensorflow that your codebase is in.


Moreover, the way you find might not be the best or the most efficient way.


Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: