Building Reliable Distributed Systems in Node.js

andrewingram · on Jan 21, 2023

Temporal is an implementation of a paradigm I got interested in back in 2019. I wasn’t at one of those companies that had heard about Cadence, so when I was searching around to see if anyone had actually already built this idea I’d come up with, I stumbled upon Zenaton. It’s no longer around, didn’t get PMF, so I was happy when Temporal came out of stealth mode a few months later - was nice to have my intuition in this area validated.

We’ve been using Temporal quite successfully in Go (and more recently Python) for a little while now. It could do with being a bit easier to get up and running with, but day-to-day usage is very nice. I don’t think I could go back to plain out message queues, this paradigm is a real time saver.

The biggest challenge is deciding how many things are nails for the hammer that is Temporal. You tend to start out using it to replace an existing mess of task orchestration; but then you realise its actually a pretty good fit for any write operation that can’t neatly work in a single database transaction (because it’s hitting multiple services, technologies, third parties etc).

You have to be careful to keep your workflows deterministic, but once you get used to the paradigm, it’s enjoyable.

lorendsr · on Jan 21, 2023

This post talks about the durable execution systems, which include Azure Durable Functions, Amazon SWF, Uber Cadence, Infinitic, and Temporal.

Durable execution systems run our code in a way that persists each step the code takes. If the process or container running the code dies, the code automatically continues running in another process with all state intact, including call stack and local variables.

Durable execution makes it trivial or unnecessary to implement distributed systems patterns like event-driven architecture, task queues, sagas, circuit breakers, and transactional outboxes. It’s programming on a higher level of abstraction, where you don’t have to be concerned about transient failures like server crashes or network issues.

MuffinFlavored · on Jan 21, 2023

you have to code your entire architecture around this premise though, no?

aka start from scratch and write things a certain way

lorendsr · on Jan 21, 2023

No, the sample app is 100% Temporal backend, but you can incrementally adopt—writing durable functions for specific processes. Usually companies start out with things that are either long running or for which reliability is particularly important, like financial transactions. Then they learn that it can be more generally useful, and expand use cases gradually.

hot_gril · on Jan 21, 2023

Sounds like the whole point is you don't have to take anything special into account. Edit: Except determinism?

lorendsr · on Jan 21, 2023

The point is that you can write code instead of JSON/YAML like traditional microservice orchestration like AWS step functions. And it’s not a limited dsl—you have the full lang at your disposal, with the one requirement that deterministic code (workflows / the durable code) is in separate functions from non deterministic code (like making a network request, called “Activities”).

MuffinFlavored · on Jan 21, 2023

https://github.com/temporalio/hello-world-project-template-j...

You have to write your code using Temporal SDK.

At a quick glance:

main() calls WorkflowServiceStubs/WorkflowClient

I also see something called an "Activity"

Also see something called a Worker.

Not trying to argue. Genuinely curious if you think this is within the realm of "not take anything special into account" (be forced to use a specific SDK and lay your logic out in the exact way it supports) or if you didn't know this was referring to Temporal?

mbj111 · on Jan 21, 2023

100% like it or not vendor lock-in is there with all workflow solutions. We need standardisation with workflow solutions as mentioned here in detail https://twitter.com/gwenshap/status/1505950830767206400?s=46...

hot_gril · on Jan 21, 2023

I thought this comment chain was about durable execution in general. Temporal seems to be that plus some RPC stuff that is a lot more than "nothing special."

lorendsr · on Jan 21, 2023

All the durable execution systems have to run your code in certain way that persists steps like RPCs (and need to provide a mechanism for you to tell the system which functions have RPCs) so they can recover in case of process failures. They all also happen to provide common orchestrator features like retries and timeouts because devs find it useful.

barbarbar · on Jan 22, 2023

Is this similar to apache camel or spring integration?

hot_gril · on Jan 21, 2023

Never heard of durable execution until now, but I've wondered about it. When I write backend code, I have to keep asking myself "what happens if the server goes down during this line of code?" This is often an issue in the middle of a customer order, like the example here. I end up relying on the database for very many tiny little things, like recording the fact that the user initiated an order before I start to process it.

But how fast is this? IIRC each little insert in my DB was taking like 5ms, which would add up quickly if I were to spam it everywhere; I assume durable execution layers are better optimized for that. Do they really only snapshot before and after async JS calls, treating all other lines as hermetic and thus able to be rerun?

lorendsr · on Jan 21, 2023

Yeah, I’ve also written this write-to-db-after-each-meaningful-line-of-code style code, and this is a great improvement. See the first 20m of this talk for an example: https://youtu.be/EFIF8gk9zy8

Starting a workflow is currently ~40ms, and I think we’ll be able to get down to 10ms this year. How long it takes to complete depends on how many persisted steps it takes (and whether it has to wait on an external event). The only steps that are persisted are workflow api calls like sleep(), startChildWorkflow(), or calling code that might fail (ie “Activity”, like a network request).

hot_gril · on Jan 21, 2023

> The only steps that are persisted are workflow api calls like sleep(), startChildWorkflow(), or calling code that might fail (ie “Activity”, like a network request).

Ok, that's what I was wondering. Makes a lot more sense this way.

lakomen · on Jan 21, 2023

Or you could not use a scripting language and save 50 times the cost

lorendsr · on Jan 21, 2023

We have Go and Java SDKs that have better performance characteristics if that’s what you’re optimizing for. I think for many businesses, optimizing for development speed is a higher priority (eg if the devs already know JS, use that). The Node runtime with v8 isolates is also able to better protect developers from writing non deterministic code (durable code must be deterministic). More info on that: https://temporal.io/blog/intro-to-isolated-vm

hot_gril · on Jan 21, 2023

That doesn't solve the problem of long-running processes, CPU time isn't the limiting factor here, and devs cost more than compute resources.