Hacker News new | past | comments | ask | show | jobs | submit login
The Workflow Pattern (blog.bittacklr.be)
129 points by kiyanwang on Oct 6, 2023 | hide | past | favorite | 34 comments



The existence of workflow DSLs seems like a symptom of a systemic weakness in common programming languages.

We use languages best suited to writing Windows GUIs or Linux kernels to implement business rules — and then we act surprised when we have to invent an entire language and runtime to solve simple business problems.

One missing feature in typical languages is the native ability to have a computation frozen in the middle of a function call and then defrosted and continue as if nothing had happened. The low-level mechanisms are often available already — such as object serialisation — but these primitives never support the serialisation of a call stack or an in-flight computation.

We even have compilers that can split up and restructure an async function into a heap object and an associated state machine!

What’s the difference between an async function that is awaiting a slow HTTP call and an async function awaiting a long-running workflow step? Only that the state machine of the latter is persisted to storage instead of the heap!

I always thought it was a bit silly that the async mechanism in modern languages is so myopic and single-purpose. These kinds of high level language transformations ought to be extensible and pluggable so that we could write workflows in a proper programming language and have it look like normal code except for the occasional “await” keyword.

PS: The same philosophy could be applied to Java Loom style async programming where threads could be marked as eligible for hibernation, in which case they would be restricted to using data types that can be safely round-tripped by the serialiser.


> I always thought it was a bit silly that the async mechanism in modern languages is so myopic and single-purpose. These kinds of high level language transformations ought to be extensible and pluggable so that we could write workflows in a proper programming language and have it look like normal code except for the occasional “await” keyword.

This is how Temporal works. For example, in Python the async event loop is replaced by a durable event loop [0], the JS promises become durable promises, .NET tasks become durable via a custom scheduler, etc. Granted it doesn't serialize the stack, it uses event sourcing almost exactly like the article describes, and therefore requires deterministic code for replaying. From the dev POV, it looks like any code can just be frozen in the middle of a function and magically resumed elsewhere.

0 - https://temporal.io/blog/durable-distributed-asyncio-event-l...

(disclaimer, I work at Temporal and have written some of these distributed coroutine impls)


Workflows engines are not just about deferring a http call or a specific language.

They are a generic way to run some background process that doesn't need too much human interaction and can be coded in any language. If you need to orchestrate tens of thousands of tasks, you will beg for a DSL and some kind of interface to monitor things.


I feel like awaiting some human interaction is one of my most common requests for workflows. await some approval or some manual task that can't be automated, etc.

honestly, when working with some new archaic system, the ability to insert manual tasks is invaluable for an early deployment


The missing feature you're describing is monads. They can essentially be a first-class abstraction for computation (among other things). They can represent synchronous computations, computations that can be suspended/resumed, computations that might fail with a particular type of exception, or anything in between.


This is one way to structure it but not the only. Scheme's call-with-current-continuation comes to mind here.


One really nifty use case of continuations is the Seaside Smalltalk web framework [1]. It implements stateful request/response conversations with Smalltalk's ability to capture and serialize continuations.

Continuation-based web frameworks are really nifty, but have potential scaling bottlenecks. IIRC HN originally used some type of continuation web framework with Arc, but ran into scaling and caching headaches. Dunno how it works these days.

[1] https://en.wikipedia.org/wiki/Seaside_(software)


I didn't know it can serialize continuation! I always thought that with seaside you have to run a single server, and any reboot would just kill the state.


I didn't think callCC can do multiple resumptions? DDBinding certainly can, and can do partial environment closures so you can migrate a process to a different machine as well: https://okmij.org/ftp/papers/DDBinding.pdf

The above is succinctly implemented in wat: https://github.com/manuel/wat-js (also see forks for more documented versions).


and there is always the Long-Running-Transactions. Where the workflow/state-machine can take days or weeks to finish. Somehow, we all still live with some hope that certain things will happen real fast so they can live in threads/processes/memory (and have no persistent backing, except maybe retry whole thing), while those that can take "longer" (how long is "longer"?) should be completely different beasts. While from business-level view, there is no much difference - but they are programmed in completely different way.

For example, in some businesses, payment is like another communication. Not that sending a single message across is very simple thing, but definitely a lot less complicated than a payment process.

Maybe the split, should it be internaly-language-handled - and forgotten if power goes off - or externalized and persisted, should depend on where there is certain boundary being crossed? Defining "boundary" per project..



Rather than focus on the DSL/code-writing aspect of workflow pattern tools, I feel the crucial difference is that most workflow systems are intended for a broad audience of users, many of whom are not devs.

I have similar doubts about how important the difference is between programming as devs do it and these tools. But to me, it's less a language concern and more a question of how tasks/jobs/processes/whatever-you-call-them are managed, if they're managed at all. In programming languages today, devs spawn countless async sub-processes (promises, async tasks, whatever you call them) with abandon. And they're implicit and invisible in almost all systems. An operator or user isn't there introspecting what jobs are outstanding. Or managing the workflow of these subprocesses.

These workflow tools are just a tiny leap over what programmers do: they make computing real. They reify computing into a managed element, make work being done a kind of data that is tracked through the system. Where-as in most programming, we have lots of tools for authoring and finagling subprocesses, but we are so very lacking in tools to manage & expose & let the user control what's happening in the process. The processs is mostly-closed, except what is afforded, and what is afforded is largely custom-craft APIs (with folks like OTel starting to define some more common ways for processes to at least explain themselves, if not manage & interconnect with other processes).

I don't think the languages have to change. I think the languages can stay right where they are, for the most part. But we can and should be building better runtimes, that let us interact with the objects & subprocesses, that let the app be more commanded and turned into an task runner rather than also the command-pallet system, kind of has to happen. It's not super related, but there was a post on Audacity in the browser, where @solardev suggested "headless" WASM apps and then separate UIs, and I can easily see that idea applying here: a workflow pattern tool being the UI, that orchestrates/drives WASM apps (which are practically just libraries).


The async/await stuff in JavaScript, at least, is supported by the “generators” feature, which does let you effectively “pause” execution of a function, retaining its state. You can then yield to that state to continue executing


Sure, but would you still be able to resume that function 2 weeks later, after 5 process restarts due to code deploys and on a different server?


CRIU can apparently do that with whole processes: https://criu.org/Main_Page

Found a link to it from telefork: https://thume.ca/2020/04/18/telefork-forking-a-process-onto-...


One of my favorite workflow engines that has a really simple way to do things was not listed here, so I'll call it out - Netflix Conductor (https://github.com/Netflix/conductor).

Its capabilities comes to light when you model really complex workflows and one real value is how its all very visual not just during modeling but when running it. The history remains visible and you can even see how the whole flow evolved.


My company has 200+ DAGs on Airflow, so that ship has sailed for me.

That said, Conductor looks really cool, thanks for sharing.


I recently created some Netflix Conductor workflows with Python workers and found it pretty good. Using a fork/join to kick off sub workflows was pretty easy. It definitely shows runtime variable data much better than Camunda.


Modeling workflows as state machines is incredibly powerful.

In addition to everything the author mentioned, the constraints of state machines allow a workflow platform to provide a ton of additional guarantees and capabilities around consistency, state propagation, reliable timers, inter-instance messaging, etc.

We built our workflow execution platform [1] around state machines and we've seen great results. We find our workflow code is incredibly simple and easy to understand.

[1] https://www.statebacked.dev


this is exactly how I go about re-architecting overgrown workflows. reduce the workflow to a state machine. It takes forever to untangle what they are wanting to do and what they are doing, but the result is always a more robust solution.

the other benefit of a statemachine is the ability to accurately determine what parts can be collapsed into subworkflows which allows for reuse, replacement, or general modification


One of the biggest issues I've observed with workflow systems is an inability to get return values from each subprocess and then decide based on those which subprocesses to instantiate next. It always seems like the workflow execution graph is static in my experience. Is that correct?


At a prior company we just wrote our own custom workflow engine which was basically a DAG processor where each node called some code. Of course, the code called could modify the DAG in place and accomplish exactly what you're describing. It worked quite well but we didn't polish it as much as we could have. I've been wanting to do a clean-room reimplementation as of late just because I miss certain things about it.


Not sure what you've tried, but Temporal (which I think is the most mature and well-designed open source workflow engine available right now) supports using the result of a "subprocess" to decide the next steps. The only requirement is that your logic is consistent across workflow runs.


That sounds very similar to AWS Step Functions to me. You define a workflow graph in JSON, then run the workflow and hand out subprocesses to lambdas etc.


I've definitely seen workflows written as single step graphs, where the output is opening a new workflow with a different single step

It was hard to read and understand


The couple workflow engines I've used supports that. It sounds crazy to not support something that obvious


Maybe before, but i feel that is not a problem anymore. Windmill.dev solves this with easee. Very good UI on top of the workflow engine.


Why would you want to rewrite the code of the workflow during execution, vs defining that logic upfront?


That's like asking why anyone would want to use an if statement. "Why would you want to change which instructions that get executed during the program? Just decide up front!". Because the decision cannot be made ahead of time.


The article is quite thorough and I enjoyed it as it reminded me of my forays in the domain. One nook in this domain, I never managed to investigate in the depth I would like is the Kinetic Rules Language [0] which makes the connection to complex event processing more pronounced.

CEP is not trendy, which is probably because of history and traditional implementations but I think it deserves a good look.

[0] https://github.com/Picolab/pico-engine is an implementation of the KRL


I built a feature of our platform on top of Apache NiFi[1] and had a great experience. I would have preferred a non-JVM language, as that'd have been easier for us in a containerized environment, but I have no complaints about NiFi otherwise.


Interesting; for me NiFi is tied with Airflow as my most-hated software I've had to use professionally.

This was a few years ago but NiFi only supported editing the DAG from a GUI; you could store the resulting XML in git, but no human could read or edit it. Python steps were stuck on Python 2.6 or 2.7. Hard to debug pileups. Is it better now?


NiFi sounds a lot like Camunda, which is awful.


One of the things I never see, is workflow integration with logging/tracing. For event-sourcing, surely the log is the best port of call? Does anyone have a pattern around this, esp wrt Clickhouse integration?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: