Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Webhook Failure Scenarios (hermanradtke.com)
100 points by luu on Sept 15, 2023 | hide | past | favorite | 45 comments


Webhooks are fun to think about. A couple more issues off the top of my head:

* Ordering. Since network requests can take variable amounts of time, how do you ensure those two "foo.updated" events are processed in order, or that the receiver can tell their order? Especially something to consider if the webhooks will retry a few times on intermittent failures.

* Consistency. Always a concern in distributed microservice land, but maybe more acute when generating webhooks right as things are updated: if the receiver uses the webhook to make an API request right back into the system, will the API have the same view of the data?

* DDoS. How do you make sure the webhook destination URL is owned by the subscriber? If your system can generate high volumes of webhook traffic, that could be a problem.

* Infinite loops. A silly one, but the user could conceivably point the webhook at a URL of the system that sends the webhook, in such a way that it will cause a new webhookable event to be generated.


Ordering is something that everyone always forgets. The only way to actually guarantee they are sent and received in order is to use a lock on whatever subset of messages you care about being ordered. Then only process one of those at a time.

At my last company I built an ordered transaction outbox to handle this.

Each message has a partition key. When the worker picks up a new message to process it locks the partition key, so that it’s the only worker processing any messages with that partition key.

Different messages use different things for their partition key. For some messages where we only care about the order within a given order (it’s an order management system) we use the order_id. For other messages where we care about global ordering we use the tenant_id etc…

You could just use tenant_id for every message, but by using the most granular partition that you can, you can get a lot more parallelism.


Well for starters, you should by-design not create an interface where individual discrete events are required to be delivered, without drops or strictly in order. Once you start down that path, you have to be able to answer what you do when the the destination is blocked (for any reason: network congestion or outage, service outage, etc.). This is the path where your internal queue (whether physical or virtual - that is, a cursor) starts requiring potential days of retention and the ability to stream from any point in that history. Been there, done that. You can implement that retention in all sorts of ways (rows in a database, messages in kafka, etc.) each of which has interesting cost dynamics and corner cases.

You are almost better off with a state compressed reporter that has the semantics of always reporting a _sensible_ series of events to the external endpoint where _dependent_ events have sensible relative order ("x was added, x was deleted") that converges on current state no matter how slow the outbound endpoint is and without regard to how many of them there are. No key locking and no buffering, O(1) storage per walker - just a walk on conveying state. This requires careful design of both the system _and_ the schema (boolean state needs to be accompanied with a counter, for example, to convey collapsed/consolidated history of the state changes in this model).

That's a lot more work, but it is possible, this kind of thing dates to the 80s or earlier.

The only in-order/no-drops case that is really common is logs (which is a poor way of conveying event/state tracking) and webhooks are a very poor choice for that kind of thing.

Also, in all cases, receivers should have to accept that they may receive the same notification more than once. Obviosuly there are ways around this, but they are expensive at scale.


I would really like to understand the second paragraph, but I don’t understand a single sentence. What exactly are you proposing? What’s a state compressed reporter?


The reporter is the system that sends updates to your external observers; observers register with the reporter and the reporter facilitates their consumption of events.

In the case of a webhook, this is the first tier (the actual event store) that individual webhooks contexts are following (on the service side). You need these to be per-outbound-destination to avoid head of line blocking.

I am proposing that people use state synchronization paradigms and not simple event replication for their notification systems. Since the remote side is not within your control (typically) the state compression all needs to be service side.

If you want to explore systems like this, you should start with the basics of how routing protocol peering actually works inside routing protocol implementations. In that context, you have dozens to hundreds of fast and slow consumers that are trying to converge with what is, basically a forever changing data set through the exchange of adds, deletes, and updates with peers. Any given downstream peer may block for extended periods of time, they may resume, and so on, and you cannot buffer significant amounts of data per peer (since the stack _must_ consume updates from its own peers).

Alternately, start looking into how databases sync with one another beyond the simple replicated transaction log sense.


Their comment makes no real sense to me either.

I can't tell from looking at their comment history whether or not they're AI, but their profile says 'I am nobody.'

It's a strange world.


In many cases none of that is possible because you aren’t designing the entire system. Either because it’s already been partially implemented, or because you’re modeling a system that does require ordering with no drops.

And whether webhooks are a good choice or not, they are often required for one reason or another.


> At my last company I built an ordered transaction outbox to handle this.

Almost all message queue/event streaming systems support FIFO (SQS, RabbitMQ, ActiveMQ, Kafka, Kinesis).

If you use one of these, it's easy to solve ordering.


If you only use one of those, for everything it’s easier. I still wouldn’t say it’s easy because it requires thought, and in my experience is commonly done incorrectly. I’ve regularly seen anything from picking the wrong partition key, to grabbing a bunch of messages from a single partition to process concurrently.

If you have an existing system where some services are fully event sourced, or some depend on complicated database transactions but still want to publish to a Kafka topic etc…, it gets more complicated.


This is just for arrival order though. The GP mentions transaction-based operations; handling this without massive hits to parallel processing is not an easy problem to solve.


? I'm not sure I understand.

You choose an FIFO key, e.g. order ID.

Any message with that key is processed serially. Ordering between keys is unspecified.

It is important to choose a key that guarantees sufficient ordering (for correctness), but not too much (for performance).


> You choose an FIFO key, e.g. order ID.

This is the part of the system that is doing most of the heavy lifting. Ex A, The struggles of Elasticsearch: https://archive.is/rm1UI


Could you not timestamp your message and then let the client do the ordering. Granted they might get messages out of order, but I never promised them their webhooks would be in order.


Timestamps from where? There is not necessarily going to be a central authority or processor who can guarantee these are in sync across distributed systems


Yes, you can push the problem to your users.


Regular timestamps can only be used reliably for ordering if all the timestamps come from a single system. Different systems have different clocks, so they'll never be precisely in sync. You can also have timestamp collisions, which makes ordering ambiguous.


This is assuming the use of a monotonic clock. the system wall clock may jump backwards in time. Itt also assumes that only one id can be generated per millisecond (or whatever clock resolution). In the case of multiple threads or processes on multiple cores, it becomes more likely that there will be a collision.

Timestamps don't make for a great ordering key.


Ordering is even more compelx than that! I previously wrote a post about it as it comes up a lot with Svix.

https://www.svix.com/blog/guaranteeing-webhook-ordering/


I've only ever worked with two webhook systems (Eventbrite and PayPal) and both have those issues handled. I'd say you should just look how others solved it.


Some interesting previous discussions here:

- Best Practices for Using Webhooks (stripe.com): https://news.ycombinator.com/item?id=32521159

- Collection of best practices for providing and consuming webhooks (webhooks.fyi): https://news.ycombinator.com/item?id=32517645

- Give me /events, not webhooks (sequin.io): https://news.ycombinator.com/item?id=27823109


For a server sending webhooks to endpoints entered in by users, take care that:

a: The FQDN does not resolve to an RFC1918 address (You don't want to be POSTing payloads to endpoints within your internal network.)

b: If you respond to redirect responses (easier to just not do so for other reasons as well) also make sure those don't resolve to internal addresses too!


The attacker-controlled DNS record pointing to an internal private address or an explicit redirect is a classic, especially if they can control the event template being used and the service relies entirely on edge filtering... Too much template control is a risk.

I mean, there's a lot of things you should do when dealing with this that most people don't pay attention to:

https://datatracker.ietf.org/doc/html/rfc2606 https://datatracker.ietf.org/doc/html/rfc3927 https://datatracker.ietf.org/doc/html/rfc4193 https://datatracker.ietf.org/doc/html/rfc6761

... and so on. At least in Go some of the handy checks are simplified by IP.Is(Private|Loopback|Multicast|InterfaceLocalMulticast()|LinkLocal*etc.)


It's been years since I've looked into this problem but to tackle it properly one shouldn't just resolve the domain and check that the IP is acceptable. The HTTP client library needs to be involved by providing a way to run code just before creating the socket, which very few do.


Oh interesting... would love to hear more about that. I guess what could happen is:

You would:

1. Resolve the FQDN

2. check the IP

3. Make the request

When the request is actually made, the FQDN is resolved again and a different IP is returned?


Those terms are very confusing - to me orign would be the SENDER ... where the message ORIGNates from ...

But yes, those are common scenarios, and most services that offer calling a webhook have retries baked in for such things. It is the responsibility of the consumer to verify if they have already received a message or not.

You should not be consuming a webhook 'live' and attempting to work on it the moment you receive it - that way lies dragons!

I have dedicated endpoints for services to send to, and they are verified and consumed as quickly as possible and put into a queue to be processed after. That way I can at least see what was sent, and if we missed something.


> to me orign would be the SENDER

My history with CDNs is that origin refers to the server sitting behind the proxy. I do agree that

> where the message ORIGNates from

makes sense too.


A CDN origin refers to the server where the content ORIGINates from.


There are so many more failure scenarios!

I wrote a post about common TLS errors recently: https://www.svix.com/blog/ssl-tls-incomplete-certificate-cha...


That needs better naming in my opinion.

Just sender and receiver would be enough IMO.

Maybe it is just me but it is unusual to see origin used in this context even if it is technically correct I only ever see it used with CDNs and load balancing.


I was frustrated with my naming too.

> technically correct I only ever see it used with CDNs and load balancing

That is why I used it!

I responded in https://news.ycombinator.com/item?id=37525971 that maybe I should change it. Sender/receiver is better.


I’ve always referred to those client/server.


I think webhooks are even more misleading because they are server to server.


And the client is actually the sender.


At the most basic level, what I like to do is create a webhook response controller, the controller enqueues a job and immediately returns an HTTP 200.

The job is saved in my postgres database, using Oban (Elixir) so I have observability, accurate bug stacktraces are saved in the job record's error field, and I can see how many times the job has been attempted.


Very cool article.

I'm one of the creators of https://webhooks.fyi/ and will do a PR to add this one. Thanks!


Cool site. Seems like a great learning resource, nice work. (No affiliation, just a sucker for good docs)


The way I solve all these problems is.

Prerequisite: require receiver to be able to process duplicate webhook being resent.

Algorithm outline:

1. Have webhook body stored by sender

2. Attempt to send body to receiver

3. If 2x status returned mark webhook as sent. Done

4. Otherwise for any other status or timeout increment attempt for webhook and try again later until maximum attempts (which then mark webhook as failed)

This covers the tricky case where success response gets dropped in network by having the sender time out and retry later. The receiver gets a duplicate


You should describe how you are backpressuring internally while this is going on and also how you send an indicator to the receiver that they encountered a gap.


Any suggestions to alternative to webhooks? I feel a ‘pull’ based model with cursors and long polling would be simpler and more reliable than webhooks.


Webhooks tend to make a lot more sense in event-driven applications . They introduce additional complexity when it comes to all sorts of edge cases - agree that a pull system is probably much easier to do error handling for, my only concern would probably be performance.


Simple: cron job, one per minute, check for recently created (and/or updated) entries.


WebSockets.


right out of the gate: why not use the terms "sender" and "receiver" instead of "client" and "origin"?

the author uses "client" to refer to the application sending the webhook, and "origin" to refer to the application receiving the webhook. this is all backwards to me.

I often say that naming things is hard, but it isn't so hard that this needs be the result.


Fair call out. I responded to similar feedback here: https://news.ycombinator.com/item?id=37525992


FWIW I wrote a program that used webhooks and RSS feeds and needed to prefix both with their direction, I ended up choosing inbound/outbound.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: