I built a PoC of a 5us trading system (guaranteed 5us response in every situation) for a brokerage house a long time ago, around the time of LMAX Disruptor. It was one man job and I had to start with nothing (they had no knowledge at all). Fun project and I learned a lot.
* full kernel bypass (I even implemented driver for the networking hardware)
* everything that could disrupt the application disabled (like SME interrupts, etc.) Memory mapped as huge buffers to prevent tlb lookup failures, etc.
* application consists of threads pinned to specified cores
* each thread on the path of market data to sending the order is never calling the operating system for anything. When not processing anything it is busy spinning.
* all memory preallocated carefully to have it pinned to the local core
* data flows from the networking hardware into one core and then passess through cores using disruptor, each core doing further processing and publishing signals to the next core
* the main insight was that rather than wait for market signals to then decide what to do, you can precalculate your responses up to and including the actual message to be sent to the exchange.
> the main insight was that rather than wait for market signals to then decide what to do, you can precalculate your responses up to and including the actual message to be sent to the exchange.
I saw a talk about this dialed up to eleven: the entire processing occurred in a "smart NIC" instead of the CPU. The response would start getting sent even as the inbound packet was still being received. The go/no-go decision was effectively just sending the final CRC bytes correctly or deliberately incorrectly, thus invalidating the outbound packet that was already 99% sent.
Before that talk I couldn't figure out why there was a market for NICs with embedded FPGAs, CPUs and memory.
Day traders basically subsidised these things, and now they do efficient packet switching for large cloud providers.
Reminds me of how crypto-mining subsidised a lot of GPU development, and now we have 4K ray tracing and AIs thanks to that.
> The go/no-go decision was effectively just sending the final CRC bytes correctly or deliberately incorrectly, thus invalidating the outbound packet that was already 99% sent.
This trick will get you banned on some of exchanges now :)
Another one, which is public knowledge for years, and also often penalized, is to send TCP fragment with header of the message well in advance, "booking" place in the queue. Then send the finishing fragment with the real order after doing all the calculations.
This bears the question of how does an exchange efficiently detect, log and take action against these kinds of behaviours without increasing its own latency too much and (perhaps?) affecting the market?
Does it even matter if a centralised exchange increases its own latency when all market participants have to go through it? I can only think of the case when a security is listed on multiple exchanges, where the latency could mean a small arbitrage opportunity.
Exchanges rarely care about their absolute latency. The latency race is for the best place in the order entry queue. As soon as the order is queued for sequential processing by the risk checker or the matching engine, the race is over. I've seen places where you needed sub-microsecond tick-to-order latency to win the race for the front of the queue, but the actual risk check and matching took tens of milliseconds.
They do care about throughput and providing fair conditions to all of the participants, though. On busiest derivatives exchanges this means resorting to FPGAs for initial checks.
Then, every message send to the exchange is clearly traceable. In some cases participants have dedicated physical lines. When the exchange sees increased rate of malformed packets from a single line or from a certain participant, they just cut it off and call the contact person from the participant (trader/broker) side to demand explanation.
Most exchanges have switched to order gateways that are either fpga or asic based.
Also every packet you send to an exchange is trivially attributed. They just kick you off if your shenanigans cause a problem. And then they tell all the other exchanges about you.
Cool! I actually wasn't aware about NICs with FPGAs on them. You learn something new every day on HN.
My solution wasn't as fast and it could never do what you describe (start sending bytes before the packet was fully received). The market signal messages were actually batched together (usually one to 5), compressed with zlib and sent as a single multicast packet.
You could in principle accelerate the most CPU intensive parts of Webservers with smart NICs. gzip, TLS, JSON serdes, html templates. There are also accelerators for databases, leaving just the business logic to be executed on the CPU.
> the main insight was that rather than wait for market signals to then decide what to do, you can precalculate your responses up to and including the actual message to be sent to the exchange
Ah those nasty market opens in the morning and trying to get a good spot in the queue
Why do you need OS for that kind of project? If you're implementing network drivers and avoiding scheduler, you could just run your code on bare metal.
Because it is convenient being able to SSH to your trading machine and run standard Unix applications for deployments, normal start of day/end of day processing, diagnostics, profiling, etc. The overhead, during normal operations can be significantly reduced.
Appreciate you describing this as a PoC because in reality it's impossible to do <5us tick to trade in software including risk, deal booking, drop copy and a tonne of other components that go into a real-life trading system.
In fact, you can't guarantee 5us for anything, at least not on common operating systems. You would have to run your code with the interrupt flag cleared to prevent any IPIs or hrticks getting in the way. But that would be opening a scary can of worms.
> Appreciate you describing this as a PoC because in reality it's impossible to do <5us tick to trade in software including risk, deal booking, (...)
Absolutely not true. I explained, risk is calculated concurrently and compiled to decision tree and then inserted in the path in the form of "if X happens do Y".
The fact it was PoC has nothing to do with it, in fact it showed that yes, it is possible to do so. That's entire point of the PoC.
Yes. You really don't want your real time threads to have anything to do with the OS after startup sequence. Significant part of the project was learning about various ways a piece of code can be preempted.
Once you get it done, it really is all really nice and predictable. And also a fire hazard because you have disabled things like SMI interrupts that are used by the CPU to figure out if it is overheating...
Every respectable HFT shop does that. What you can't do is achieve <5us execution latency at 100% percentile. You'd have to disable LAPIC interrupts for that which I doubt they did.
not gonna lie, i have to literally look up the meanings of half the words in your post lol, but you must have been in my position sometime. What did you have to learn in order to build this
Each to their own but if you read and understand the comment above they're describing a dedicated OS for the task .. so think about what you'd choose to write a small task dedicated OS with.
Simple C is most likely, ASM is possible, a language such as OCaml generating C to hook into the low level buffers would be intriguing ... the list is long and largely determined by the experience preference of whoever tackles it.
The major features for performance are to allocate and manage all memory from the start .. determine your thresholds of performance and put everything required together on cold start so as to avoid any thrashing at runtime.
I used combination of Common Lisp (SBCL), ANSI C and assembly (not much assembly, though, only very small pieces that I had trouble emitting other ways).
The main application was in Lisp, it would start up and set up the environment (using some low level code written in C/assembly).
But everything on the path of the market signal to order would be highly optimised native machine code, but some of that code would be written in C and some of it would be compiled at runtime with Lisp.
Parsing the incoming multicast feed from the exchange was implemented with Lisp emitting super efficient machine code (almost no function calls) based on a bunch of XML files describing the messages. I shamelessly stole the idea from the book Practical Common Lisp (really good if you want to get into Lisp).
Things like business rules would be compiled into decision trees using current market situation and decision trees would be reorganised, optimised and compiled into machine code and inserted into the processing path. Most of the time a very large decision tree with hundreds of complex decisions (for example, taking into account market volatility) could be distilled to just couple branching instructions. This decision tree compilation would happen after every market signal, up to 10 thousand times a second (the exchange had a basic tick of 1/10000 of a second giving me guaranteed 100us before the next market signal).
Same with actual algorithms -- I wrote a small DSL for the traders and this DSL would be compiled with Lisp to machine an inserted into the processing path.
Some parts of the framework would be built with Lisp, too. It was easier for me to write a DSL and then compile it to machine than to write it in C.
If you want to understand one principle from all this is to look at all instructions and especially branches that you have between receiving the signal and emitting the order and try to figure out if you can eliminate it and if you can't, try to find ways to do it ahead of time even if doing it ahead of time requires a lot more effort.
I too made a real time hardware level trading system back in the day, on the back of building a multi channel seismic aquisition system with a custom RealtimeOS talking to a bunch of DSP cards that each sampled a trailing flotation cable that each had multiple microphones with the entire grid going toward building up a profile of the seafloor and layers underneath along with bobbing boat|cable cancelation, etc.
Very similar architectures in many ways - with rolling DSP filters in place of trading response algorithms .. etc.
Lisp -> ASM or Lisp -> C either way code generation from a higher language was a good way to get the heavy lifting done.
what would you do if you wanna run all this on a web server instead? sorry if thats the wrong question. how does your system interact with say android or ios clients or a webapp with ui
If you have a project that you want to see done, you will be more likely to succeed using some more traditional architecture.
What I described is chasing latency at all costs. The costs are hardware costs, maintainability costs, development costs, inefficiency (yes, a lot more CPU is used than what is needed just to run the critical path fast). This is very extreme situation and it would be very unlikely to be a good tradeoff for your application.
If you have a lot of clients connecting there are different tradeoffs to think about and different possible architectures to evaluate, but I can't help you not knowing what your problem is.
The low-latency core communicates with a more normal app via messages, over a socket or through shared memory or whatever. So you have normal UI code where a user can enter an instruction, then instead of writing to a database or making a HTTP request, it sends a message to the low latency process. That should respond to acknowledge it, and it will send more messages later if it makes any trades etc.
> implemented with Lisp emitting super efficient machine code
I'm really intrigued by your usage of Lisp. It's on my bucket list to learn and your post is very inspiring.
When you say Lisp was emitting machine code; are you referring to the machine code the Lisp compiler emitted for your Lisp application or was your Lisp application acting as a sort of JIT compiler and actually emitting machine code?
Maybe you can reference the section in Practical Common Lisp that introduces the idea?
Machine code isn't some kind of magical idea. It is just bytes and you can craft the bytes any way you want.
In my case I wrote my own compiler that generated machine code bytes based on my needs.
The compiler wasn't at all complicated. It was not general compiler that would have to be able to deal with everything you could throw at it. It only supported one model of CPU. It did not require any optimisation mechanisms because it was assumed the instructions it received were already optimised by high level logic that generated them. It would only support very small number of instructions or mechanisms, it was mostly manipulating registers and doing branches and jumps. It produced small pieces of code to be inserted into larger C program mostly, so that this C program did not have to run decisions on what to do based on something else.
You can also use SBCL to emit any vops you want to execute as part of your Lisp program. So if you want to have a piece of crafted code to run as part of your Lisp program (for example to do some complicated operation that would be inefficient in lisp) you can also do it. I actually used vops to do some low level stuff but that was another matter, most of the time I was just outputting a buffer with crafted native machine code which would then be sent to the receiving thread (core) to insert and execute it for the next market signal.
How did you inject the machine code into a running program on anything near a modern architecture? Unless you had your own OS, wouldn't code segments be RO and data segments not executable?
I don't know enough about any processor created in the last 30 years to know if running bare metal without a commercial OS would allow you to not have those constraints.
My team did it in Rust. Others in our company have done it in C++, making heavy use of available wizardry. It's widespread in the industry to do it in Java, albeit a strange subset of Java where you avoid most of the features.
I'd be amazed if anyone did it entirely in C. The productivity is just too low.
The question really is "why?" Most systems have relatively few hotspots or can be reorganised to have relatively few hotspots. You don't need to write everything in a low level language to just have those few things run fast, unless you are also facing other constraints like low memory.
I have better uses of my time than write megabytes of pointer manipulation in C. That's why I prefer high level languages that can easily coexist with C for those few parts that really benefit.
A trading system doesn't have a single hot spot that can be easily optimized. The critic paths can be hundreds of thousands of line of code.
There a lot of separate processes and subsystem that do not need to be absolutely latency optimal (or even at all), that are definitely amenable to be written on other languages. Of course when you already have a bunch of C++ programmes, things tend to end up being written in C++ even if it is not strictly necessary.
I wouldn't bundle C with Golang and Java with Node. Golang and Java are roughly a tier on their own. Java is definitely a thing in this space[1], although it's a fairly unidiomatic style of Java that reduces allocations and puts a lot of emphasis on consistent low latency.
[1] e.g. https://marketswiki.com/wiki/TRADExpress ; though they've been absorbed into Nasdaq now and my visibility into CFT ended with that, no idea how much of their software is still running
thank you for sharing, it has a few code snippets here and there. Would you be aware of a course or something (Google didnt help) that teaches how to build one from the ground up