The LMAX system is more about latency than throughput. When designing for very low-latency the result can be a system that achieves very great throughput if the appropriate techniques are employed. Single threaded applications are very suitable for low-latency because of the avoidance of lock contention and predictability they bring.
Some means of reliable delivery of messages, to and from this single thread, is necessary to make a useful application. These messages must be delivered in the event of system failures. To address this need the Disruptor is employed to pipeline and run in parallel, the replication, journalling and business logic for these messages. The whole system is asynchronous and non-blocking.
In our architecture we have multiple gateway nodes that handle border security and protocol translation to and from our internal IP multi-cast binary protocol for delivery to highly redundant nodes. Lots of external connections can be multiplexed down from the outside world this way.
The 6 million TPS refers to our matching engine business logic event processor. We have other business logic processors for things like risk management and account functions. These can all communicate via our guaranteed message delivery system that can survive node failures and restarts, even across data centres.
Modern financial exchanges can process over 100K TPS and have to respond with latency in the 100s of microseconds firewall to firewall, thus including the entire internal infrastructure. For those tracking the latest developments will see it is possible to have single digit microsecond network hops with IP multicast with user space network stacks and RDMA over 10GigE. Even a well tuned 1GigE stack can achieve sub 40us for a network hop. For reference single digit microseconds is in the same space as a context switch on a lock with the kernel arbitrating. Most financial exchanges rely on having data on multiple nodes before the transaction is secure. A number of these nodes can be asynchronously journalling the data down to disk. At LMAX we tend to have data in 3 or more nodes at any given time.
In my experience of profiling many business applications the vast majority of the time is either spent in protocol translation such as XML or JSON to business objects, or within the JDBC driver doing buffer copying and waiting on the database to respond, when the application domain is well modelled.
Often applications are not well modelled for their domain. This can result in algorithms that, rather than be O(1) for most transactions, have horrible scale up characteristics because of inappropriate collections representing relationships. If you have the luxury of developing an in-memory application requiring high performance it quickly becomes apparent the cost of a CPU cache miss is the biggest limitation to latency and throughput. For this one needs to employ data structures that exhibit good mechanical sympathy for CPU and memory subsystem. At LMAX we have replaced most of the JDK collections with our own that are cache friendly and garbage free.
So far we have had no issue processing all transactions for a given purpose on a single thread, or holding all the live state in memory for a single node. If we ever cannot process all the transactions necessary on a single thread then we simply shard the model across threads/execution contexts. We only hold the live mutating data in-memory and archive out to database completed transactions as they are then read only.
Some means of reliable delivery of messages, to and from this single thread, is necessary to make a useful application. These messages must be delivered in the event of system failures. To address this need the Disruptor is employed to pipeline and run in parallel, the replication, journalling and business logic for these messages. The whole system is asynchronous and non-blocking.
In our architecture we have multiple gateway nodes that handle border security and protocol translation to and from our internal IP multi-cast binary protocol for delivery to highly redundant nodes. Lots of external connections can be multiplexed down from the outside world this way.
The 6 million TPS refers to our matching engine business logic event processor. We have other business logic processors for things like risk management and account functions. These can all communicate via our guaranteed message delivery system that can survive node failures and restarts, even across data centres.
Modern financial exchanges can process over 100K TPS and have to respond with latency in the 100s of microseconds firewall to firewall, thus including the entire internal infrastructure. For those tracking the latest developments will see it is possible to have single digit microsecond network hops with IP multicast with user space network stacks and RDMA over 10GigE. Even a well tuned 1GigE stack can achieve sub 40us for a network hop. For reference single digit microseconds is in the same space as a context switch on a lock with the kernel arbitrating. Most financial exchanges rely on having data on multiple nodes before the transaction is secure. A number of these nodes can be asynchronously journalling the data down to disk. At LMAX we tend to have data in 3 or more nodes at any given time.
In my experience of profiling many business applications the vast majority of the time is either spent in protocol translation such as XML or JSON to business objects, or within the JDBC driver doing buffer copying and waiting on the database to respond, when the application domain is well modelled.
Often applications are not well modelled for their domain. This can result in algorithms that, rather than be O(1) for most transactions, have horrible scale up characteristics because of inappropriate collections representing relationships. If you have the luxury of developing an in-memory application requiring high performance it quickly becomes apparent the cost of a CPU cache miss is the biggest limitation to latency and throughput. For this one needs to employ data structures that exhibit good mechanical sympathy for CPU and memory subsystem. At LMAX we have replaced most of the JDK collections with our own that are cache friendly and garbage free.
So far we have had no issue processing all transactions for a given purpose on a single thread, or holding all the live state in memory for a single node. If we ever cannot process all the transactions necessary on a single thread then we simply shard the model across threads/execution contexts. We only hold the live mutating data in-memory and archive out to database completed transactions as they are then read only.
Martin (LMAX CTO)