Re: 1. There is no "hoping", the input disrupter garauntees a total ordering to events as written to a journal. this journal is then processed by BLPs.
RE: atomiticity -- the BLP processes one message at a time, which ensures that you do not have multiple threads clashing, but you do not have MVCC or any form of rollback, which he addresses thusly:
>LMAX's in-memory structures are persistent across input events, so if there is an error it's important to not leave that memory in an inconsistent state. However there's no automated rollback facility. As a consequence the LMAX team puts a lot of attention into ensuring the input events are fully valid before doing any mutation of the in-memory persistent state. They have found that testing is a key tool in flushing out these kinds of problems before going into production.
I am more concerned with the hand-waving around the failure case -- falling over to an alternate BLP on failure does not prevent you from duplicate instructions; if a processing an event would create multiple output events for the output disrupter, but the blp is terminated before all output events are sent you either a) must have some kind of multi/exec on output events or b) must write code that is able to resume the processing of a message from an intermediary point or c) must otherwise prevent or accomodate duplicate output events from the same source event.
This is a result of the lack of "transactionality" that you are referring to, and I would love to read more about how they address this particular sticky wicket when a system fails.
Single nodes can die in the system without issue. They often do! Since we use IP multicast the network failure is transparent as a replica takes up the primary role.
The one issue to be managed with this type of system is exceptions in the business logic thread. This can be handled via a number of prevention techniques. First, apply very strict validation on all input parameters. Second, take a test driven approach to development; at LMAX we have 10s of thousands of automated tests. Third, code so methods are either idempotent, or changes are only applied at the end when the business logic is complete using local variables. With this combination of approaches we have not seen a production outage due to business logic thread failure in over a year of live operation.
RE: atomiticity -- the BLP processes one message at a time, which ensures that you do not have multiple threads clashing, but you do not have MVCC or any form of rollback, which he addresses thusly:
>LMAX's in-memory structures are persistent across input events, so if there is an error it's important to not leave that memory in an inconsistent state. However there's no automated rollback facility. As a consequence the LMAX team puts a lot of attention into ensuring the input events are fully valid before doing any mutation of the in-memory persistent state. They have found that testing is a key tool in flushing out these kinds of problems before going into production.
I am more concerned with the hand-waving around the failure case -- falling over to an alternate BLP on failure does not prevent you from duplicate instructions; if a processing an event would create multiple output events for the output disrupter, but the blp is terminated before all output events are sent you either a) must have some kind of multi/exec on output events or b) must write code that is able to resume the processing of a message from an intermediary point or c) must otherwise prevent or accomodate duplicate output events from the same source event.
This is a result of the lack of "transactionality" that you are referring to, and I would love to read more about how they address this particular sticky wicket when a system fails.