Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LMAX Disruptor – High Performance Inter-Thread Messaging Library (lmax-exchange.github.io)
126 points by dgudkov on Nov 18, 2023 | hide | past | favorite | 85 comments


Every time a new generation plays with the LMAX disruptor, it's time to remind them that the modes with multiple producers/consumers can have really bad tail latency if your application's threading is not designed in the intended way.

Disruptor and most other data structures that come from trading are designed to run with thread-per-core systems. This means systems where there will be no preemption during a critical section. They can get away with a lot of shenanigans on the concurrency model due to this. If you are using these data structures and have a thread-per-request model, you're probably going to have a bad time.


In general, what are the advantage of a thread-per-request model? Better load balancing between cores?


Thread per request can never have better load balancing between cores than a well designed, custom solution.

You are essentially asking the operating system to do the scheduling for you. But the OS will never be able to do it perfectly as it has no knowledge of what your application is doing.

The main advantage of OS scheduling is that you get pretty good results without having to think about it at all. Pretty good, but never perfect.


> You are essentially asking the operating system to do the scheduling for you. But the OS will never be able to do it perfectly as it has no knowledge of what your application is doing.

From LMAX presentations, it looks like they want you to split your application into tasks [1], define a graph of task dependencies, have each core process a particular kind of task and have task processors communicate their producers via a ring buffer.

In particular, the allocation of tasks is static. The use of a ring buffer means that there is very little contention, and task processing is very efficient, but some cores might end up underutilized.

On the other hand, if you have a thread per request, and allow them to migrate between cores, idle cores can steal tasks from busy ones. So in theory you could get better utilization, but task processing is less efficient since you need to share more data between cores.

("threads" don't need to be OS threads, they can be green threads)

That said, I am not sure if GP meant this by thread-per-request, or "legacy" applications that use a thread pool, or something else.

[1]: https://www.slideshare.net/trishagee/introduction-to-the-dis...


I know all about LMAX architecture, at least all that has been published (see my other comments for this submission).

Static allocation is a special case of scheduling. You decide which parts of the process run on which core -- the scheduling in this case is done at design or configuration time.

> On the other hand, if you have a thread per request, and allow them to migrate between cores, idle cores can steal tasks from busy ones. So in theory you could get better utilization,

Migrating your tasks between cores is nothing you can't design into your application. For example, in a typical event driven architecture where you have worker threads running each on separate core, there would be something to decide where the task is queued and usually the logic will take into account how busy particular worker thread is. The operating system does nothing in this case, what it sees is a number of threads each running on its separate core that don't need to be preempted (hopefully).


> For example, in a typical event driven architecture where you have worker threads running each on separate core, there would be something to decide where the task is queued and usually the logic will take into account how busy particular worker thread is.

Wouldn't that work well only if the time taken by each task is predictable? I.e. you mention working on a trading system. But in a trading system you want to run the same branchless code path regardless of the kind of incoming event and whether you are sending an order or not after running the trading logic. So the individual "task" is very predictable.

On the other hand think of a task like "return all the comments for a certain page". The time taken by an individual task is unpredictable, proportional to the number of comments. So you'll regularly get one core getting enqueued a bunch of tasks with no comments, finishing quickly and then staying idle.

With work stealing, after finishing, that core would get a chance at "stealing" tasks from other threads' queues.

(of course, the architecture I am describing would be awful for a trading system)

> The operating system does nothing in this case, what it sees is a number of threads each running on its separate core that don't need to be preempted (hopefully).

Btw, I agree that pinning OS threads to each core and then layering something of your own on top of it is going to be faster. It is just that you can layer on top a green thread system (like Go), and get something thread-for-request -like.


I feel like you are stuck on the idea of “this should be simple for me” which, of course, favors the OS threads solution. If your point is that LMAX inst just a drop in substitute for OS threads then we agree. If your point is that OS threads produce better results than a thought out LMAX solution then we do not. Most people and organizations probably don’t have the need or the skill for LMAX anyway.


Easy to implement.


But does anyone run an OS thread per request unironically? I thought that nearly every request-response server implementation would use a thread pool. The best, like Erlang, can give you the feeling of arbitrarily many extremely cheap threads, while also running on a thread pool.


As far as comparisons to thread-per-core go, thread per request applies whether it's an OS thread or a green thread or a Rust async function compiled into a state machine. Anything that multiplexes per-request contexts into a lesser amount of cores(/OS threads) has the same trade-offs, the difference is more on the easy-vs-optimized spectrum. Thread-per-core with fixed workloads behaves differently than all of those.

Here's an example difference: in thread-per-request, any global state can be accessed from "anywhere", and thus you end up with locks, reference counts, GC, and what not. In thread-per-core, global state is sharded across cores and never accessed "from the outside", and thus needs no locks/atomics (beyond the messaging primitive).


> In thread-per-core, global state is sharded across cores and never accessed "from the outside"

...or possibly it's accessed from everywhere, and locks are needed again. What you're describing is kind of an ideal application architecture, not a hard rule about thread pinning.


What you need to share, you may adorn with locks, queues, etc. But ideally you share very little.

"Possibly accessed from anywhere" is a bad design in general, and unacceptable in realtime processing, where you need to know access patterns exactly.


As a devil's advocate argument, if you're doing serverside rendering, and basically getting 1 request per visit, sure there's overhead, but even like a landslide HN death hug is only a handful requests per second. A raspberry pi could feasibly serve that traffic spawning 1 thread per request.

... not that I think anyone is doing this outside of maybe some hobbyist building their own HTTP server for fun.


> does anyone run an OS thread per request unironically?

Of course they do. There are loads of appropriate applications. Heck, people still run CGI programs unironically.


And if your use case allows it it is a great model. Easy to setup, easy to debug, easy to run.

Also, why I love the new virtual threads in Java. Will they work as good as hoped? No idea. Probably not for all use cases. But the direction to say: You know what, threads are great to program and debug compared to the alternatives, maybe let's find a way to make their performance better instead of putting up with async , is so refreshing.


And also time to remind that generation raised on other meandering, unfruitful roads...That is coded in the industrial strength of Java.


I am working on a C version of the disruptor ringbuffer it is very simple and I need to verify it so it's probably not ready for others but it might be interesting. Aligning by 128 bytes has dropped latency and stopped false sharing.

I have gotten latencies to 50 nanoseconds and up.

disruptor-multi.c(SPMC) and disruptor-multi-producer.c (MPSC) https://GitHub.com/samsquire/assembly

I am trying to work out how to support multiple producers and multiple readers (MPMC) at low latency that's what I'm literally working on today.

The MPSC and SPMC seem to be working at low latencies.

I am hoping to apply actor model to the ringbuffer for communication.

I'm also working on nonblocking lock free barrier. This has latency as low as 42 nanoseconds and up.


Do you think it’s possible to obtain this performance with Rust?

I’ve been down the path you’re on a few times and I love the pursuit. Have built my own over the years about 4 times.

Hardware was much slower in those days so my lower barrier was 650ns. Things got worse appreciably as a function of the number of producers I found.

Some of my most sleepless nights. The funnest nights.


How many producers and how many consumers is that 650 nanoseconds?

I have pinned threads to even numbered cores with pthread_setaffinity_np and that seems to have evened out the MPMC ringbuffer - 2 producers 2 consumers to under 400 nanoseconds, usually under 1000 nanoseconds. I think hyperthreading causes problems.

EDIT: Would you like to chat about this? I would like to! My email is in my profile.


std::collections::vec_deque is implemented as a growable ring buffer so you might like to start there

https://doc.rust-lang.org/std/collections/vec_deque/index.ht...


I had implemented more-or-less this same concurrency scheme for an IPS/DDoS prevention box ~10 years ago, running on Tilera architecture. It was fast (batching + separating read & write heads really does help a ton)... but not as fast as Tilera's built-in intercore fabric. It had some limitations but was basically a register store/load to access and only like 1 or 2 cycles intercore latency.

(Aside, generic atomic operation pro-tip: don't if you can help it. Load + local modify + store is always faster than atomic modify, if you can make the memory ordering work out. And if you can't do away with an atomic modify, batch your updates locally to issue fewer of them at least.)


Tilera! There's a name I'd not heard in years. They leaned very heavily on providing a very large number of cores with small local storage and having to do all the intercore work yourself.


> batch your updates locally to issue fewer of them at least

I don’t know why I never thought of this, brilliant!


Related. Others?

Disruptor: High performance alternative to bounded queues - https://news.ycombinator.com/item?id=36073710 - May 2023 (1 comment)

LMAX Disruptor: High performance method for exchanging data between threads - https://news.ycombinator.com/item?id=30778042 - March 2022 (1 comment)

The LMAX Architecture - https://news.ycombinator.com/item?id=22369438 - Feb 2020 (1 comment)

You could have invented the LMAX Disruptor, if only you were limited enough - https://news.ycombinator.com/item?id=17817254 - Aug 2018 (29 comments)

Disruptor: High performance alternative to bounded queues (2011) [pdf] - https://news.ycombinator.com/item?id=12054503 - July 2016 (27 comments)

The LMAX Architecture (2011) - https://news.ycombinator.com/item?id=9753044 - June 2015 (4 comments)

LMAX Disruptor: High Performance Inter-Thread Messaging Library - https://news.ycombinator.com/item?id=8064846 - July 2014 (2 comments)

Serious high-performance and lock-free algorithms (by LMAX devs) - https://news.ycombinator.com/item?id=4022977 - May 2012 (17 comments)

The LMAX Architecture - 100K TPS at Less than 1ms Latency - https://news.ycombinator.com/item?id=3173993 - Oct 2011 (53 comments)


Semi-related is the Aeron project: https://github.com/real-logic/aeron


I've actually seen this particular library used (and misused and abused). People tried to offload I/O and data heavy tasks on it and was a spectacular fail, with multiple threads getting blocked and people having to frequently adjust it's buffer size and batch size.

One of those things to remember is that Java I/O layering (stuff like JPA) is really terrible. And people in my known Java world tend to prefer the abstractions while the people in the trading world try to use GC-less code (unboxed primitives and byte arrays).

Unless you have verified your E2E I/O to be really fast (possible off heap), you're just pushing a few bytes here and there, your latencies are all in check - this library is not for you. Do all that work first, then use this library.


LMAX - How to Do 100K TPS at Less than 1ms Latency: Video

https://www.infoq.com/presentations/LMAX/


There's a whole new generation of engineers for whom this is new news. Enjoy!


I built trading systems for LMAX exchanges. Their technology seems quite far from the state of the art to me.

I didn't know they even claimed to attempt being the fastest exchange in the world. They're very far from being so and it's quite clear that there are architectural decisions in that platform that would prevent that.


What kinds of issues did you see?. Do you think there are better alternatives to the disruptor ?


I don't particularly know anything about this disruptor, and it being in Java kinda biases me towards dismissing it out of hand (no one does serious systems programming in Java).

From a quick reading it's just a standard spmc queue with some mpmc capabilities. Queues (preferably lock-free and bounded) are a basic component of any low-latency distributed software system. They seem to understand the basics right, nothing too outstanding, some decisions quite suboptimal.

Myself I use spsc task queues for inter-thread communication (because in that scenario you know who you're sending tasks to, and you can easily just attach multiple spsc queues for pseudo-mpsc capabilities), and mpmc message queues for inter-process communication (because that scenario is more of a message bus, and you don't know who's talking to who).

I have built these kinds of things many times together with bespoke threading and scheduling models, as have others at all of the trading shops I've seen, so I'd say it's a pretty standard thing in the industry.

Open-source frameworks of interest would be Seastar or DPDK.

However, while a good threading model helps, it's far from sufficient to be the highest performance trading exchange. You also need to think hard about networking, be it the protocols, the software, the hardware and the topology. For example key factors in trading are deterministically publishing data to all participants at the same time, ensuring private information is not published before its public equivalent, making sure that whoever sent their packet first gets processed first. Even something as simple as the kind of switches you use has a huge impact.


It is a broadcast queue. Consumers all receive the same data, so they don't contend to pop elements. And producers never wait for consumers. Slow consumers have to deal with missing packets.

At the end of the day, it is a specialised ring buffer that happens to be useful for many use cases.


That just sounds like a normal spmc queue as I said above.

Of course they receive the same elements, or it wouldn't be multi-consumer.

Of course producers don't wait for consumers, they don't even need to be aware of how many there are or where they are. But even in a system where you'd know (e.g. publishing data to a bunch a TCP connections), it would he a very bad idea to stall production -- handling back pressure should be application-specific.

You have the same elements in UDP multicast which is the network equivalent, and incidentally the preferred technology for communication in the trading industry, particularly for market data disseminated by exchanges.


Topically in an MC queue, a consumer "consumes" an element and won't be available for other consumers. For example a job queue. Also typically queues either grow unbounded or pushes fail.

So no, I wouldn't say that the disruptor is a normal SPMC or MPMC queue as it semantics are different.

But yes, it has pretty much the same characteristics of UDP.


I see what you mean. I would never use a pattern where you don't broadcast to all consumers myself.

And while you can fail to produce when a consumer is too slow, I'd argue that sort of pattern only makes sense for a single consumer as well (the network equivalent would be TCP, which by definition can only be unicast).


Well, TCP is stream oriented and unicast, but doesn't take much to conceive a message oriented protocol that does anycasting.

As a real world example of anycasting queue that I'm sure you have used, consider the queue behind accept(2) or pthread_condition_wait.


Where should one look/read about for state of the art?


In terms of trading exchanges, I'd say the ones with the best deterministic low-latency appear to be the Deutsche Boerse T7 ones, in particular Eurex.

This has led participants wishing to compete there on speed to use some pretty advanced custom ASICs.


Martin Fowler has a lovely deep-dive blog post on this architecture:

https://martinfowler.com/articles/lmax.html

It includes lots of diagrams and citations.

One term I always loved re: LMAX is “mechanical sympathy.” Covered in this section:

https://martinfowler.com/articles/lmax.html#QueuesAndTheirLa...


I came across this a few years back when numbly watching the dependencies scroll by during some Java install.

“Disruptor is a fairly presumptuous name for a package” I thought. So I looked into it. It fed musings and thought experiments for many walks to and from the T. I love the balance between simplicity and subtlety in the design.

If i recall, it was a dependency for log4j, which makes sense for high volume logging.


I love this pattern. There are many problems that fit it quite well once you start thinking in these terms - Intentionally delaying execution over (brief amounts of) time in order to create batching opportunities which leverage the physical hardware's unique quirks.

Any domain with synchronous/serializable semantics can modeled as a single writer, with an MPSC queue in front of it. Things like game worlds, business decision systems, database engines, etc. fit the mold fairly well.

The busy-spin strategy can be viewed as a downside, but I can't ignore the latency advantages. In my experiments where I have some live "analog" input like a mouse, the busy wait strat feels 100% transparent. I've tested it for hours without it breaking into the millisecond range (on windows 10!). For gaming/real-time UI cases, you either want this or yield. Sleep strategies are fine if you can tolerate jitter in the millisecond range.


Beware the advertised latency will probably be when using the busy-spin wait strategy which uses a lot of CPU resource.

Great library which makes processing concurrent streams incredibly easy.


I never understand the reason open sourcing a trading system, if it works.


This is a matching engine. It's the platform where traders trade.


stupid question: how to build a trading system? anyone got a starter guide, resources?


I built a PoC of a 5us trading system (guaranteed 5us response in every situation) for a brokerage house a long time ago, around the time of LMAX Disruptor. It was one man job and I had to start with nothing (they had no knowledge at all). Fun project and I learned a lot.

* full kernel bypass (I even implemented driver for the networking hardware)

* everything that could disrupt the application disabled (like SME interrupts, etc.) Memory mapped as huge buffers to prevent tlb lookup failures, etc.

* application consists of threads pinned to specified cores

* each thread on the path of market data to sending the order is never calling the operating system for anything. When not processing anything it is busy spinning.

* all memory preallocated carefully to have it pinned to the local core

* data flows from the networking hardware into one core and then passess through cores using disruptor, each core doing further processing and publishing signals to the next core

* the main insight was that rather than wait for market signals to then decide what to do, you can precalculate your responses up to and including the actual message to be sent to the exchange.


> the main insight was that rather than wait for market signals to then decide what to do, you can precalculate your responses up to and including the actual message to be sent to the exchange.

I saw a talk about this dialed up to eleven: the entire processing occurred in a "smart NIC" instead of the CPU. The response would start getting sent even as the inbound packet was still being received. The go/no-go decision was effectively just sending the final CRC bytes correctly or deliberately incorrectly, thus invalidating the outbound packet that was already 99% sent.

Before that talk I couldn't figure out why there was a market for NICs with embedded FPGAs, CPUs and memory.

Day traders basically subsidised these things, and now they do efficient packet switching for large cloud providers.

Reminds me of how crypto-mining subsidised a lot of GPU development, and now we have 4K ray tracing and AIs thanks to that.


> The go/no-go decision was effectively just sending the final CRC bytes correctly or deliberately incorrectly, thus invalidating the outbound packet that was already 99% sent.

This trick will get you banned on some of exchanges now :)

Another one, which is public knowledge for years, and also often penalized, is to send TCP fragment with header of the message well in advance, "booking" place in the queue. Then send the finishing fragment with the real order after doing all the calculations.


This bears the question of how does an exchange efficiently detect, log and take action against these kinds of behaviours without increasing its own latency too much and (perhaps?) affecting the market?

Does it even matter if a centralised exchange increases its own latency when all market participants have to go through it? I can only think of the case when a security is listed on multiple exchanges, where the latency could mean a small arbitrage opportunity.


Exchanges rarely care about their absolute latency. The latency race is for the best place in the order entry queue. As soon as the order is queued for sequential processing by the risk checker or the matching engine, the race is over. I've seen places where you needed sub-microsecond tick-to-order latency to win the race for the front of the queue, but the actual risk check and matching took tens of milliseconds.

They do care about throughput and providing fair conditions to all of the participants, though. On busiest derivatives exchanges this means resorting to FPGAs for initial checks.

Then, every message send to the exchange is clearly traceable. In some cases participants have dedicated physical lines. When the exchange sees increased rate of malformed packets from a single line or from a certain participant, they just cut it off and call the contact person from the participant (trader/broker) side to demand explanation.


Most exchanges have switched to order gateways that are either fpga or asic based.

Also every packet you send to an exchange is trivially attributed. They just kick you off if your shenanigans cause a problem. And then they tell all the other exchanges about you.


Cool! I actually wasn't aware about NICs with FPGAs on them. You learn something new every day on HN.

My solution wasn't as fast and it could never do what you describe (start sending bytes before the packet was fully received). The market signal messages were actually batched together (usually one to 5), compressed with zlib and sent as a single multicast packet.


You could in principle accelerate the most CPU intensive parts of Webservers with smart NICs. gzip, TLS, JSON serdes, html templates. There are also accelerators for databases, leaving just the business logic to be executed on the CPU.


> the main insight was that rather than wait for market signals to then decide what to do, you can precalculate your responses up to and including the actual message to be sent to the exchange

Ah those nasty market opens in the morning and trying to get a good spot in the queue


Why do you need OS for that kind of project? If you're implementing network drivers and avoiding scheduler, you could just run your code on bare metal.


Because it is convenient being able to SSH to your trading machine and run standard Unix applications for deployments, normal start of day/end of day processing, diagnostics, profiling, etc. The overhead, during normal operations can be significantly reduced.


Appreciate you describing this as a PoC because in reality it's impossible to do <5us tick to trade in software including risk, deal booking, drop copy and a tonne of other components that go into a real-life trading system.

In fact, you can't guarantee 5us for anything, at least not on common operating systems. You would have to run your code with the interrupt flag cleared to prevent any IPIs or hrticks getting in the way. But that would be opening a scary can of worms.


> Appreciate you describing this as a PoC because in reality it's impossible to do <5us tick to trade in software including risk, deal booking, (...)

Absolutely not true. I explained, risk is calculated concurrently and compiled to decision tree and then inserted in the path in the form of "if X happens do Y".

The fact it was PoC has nothing to do with it, in fact it showed that yes, it is possible to do so. That's entire point of the PoC.


It is absolutely possible, I've seen multiple such systems.

> In fact, you can't guarantee 5us for anything, at least not on common operating systems.

Oh yes you can, with a screen-sized kernel cmdline, some proper configuration, both hardware and software, and a bit of luck.


> you can't guarantee 5us for anything, at least not on common operating systems

I don’t doubt it, but from my reading of OP’s post, it sounds they were skipping most of the kernel and OS.


Yes. You really don't want your real time threads to have anything to do with the OS after startup sequence. Significant part of the project was learning about various ways a piece of code can be preempted.

Once you get it done, it really is all really nice and predictable. And also a fire hazard because you have disabled things like SMI interrupts that are used by the CPU to figure out if it is overheating...


Every respectable HFT shop does that. What you can't do is achieve <5us execution latency at 100% percentile. You'd have to disable LAPIC interrupts for that which I doubt they did.


You don't do risk, deal booking, etc in between the tick and the trade. You do them afterwards.

You do drop copy completely separately, that's the whole point of it.


not gonna lie, i have to literally look up the meanings of half the words in your post lol, but you must have been in my position sometime. What did you have to learn in order to build this


what language do you think would be the backbone for such a system? C/C++/Golang or something high level like node.js/Java


Each to their own but if you read and understand the comment above they're describing a dedicated OS for the task .. so think about what you'd choose to write a small task dedicated OS with.

Simple C is most likely, ASM is possible, a language such as OCaml generating C to hook into the low level buffers would be intriguing ... the list is long and largely determined by the experience preference of whoever tackles it.

The major features for performance are to allocate and manage all memory from the start .. determine your thresholds of performance and put everything required together on cold start so as to avoid any thrashing at runtime.


You are surprisingly accurate.

I used combination of Common Lisp (SBCL), ANSI C and assembly (not much assembly, though, only very small pieces that I had trouble emitting other ways).

The main application was in Lisp, it would start up and set up the environment (using some low level code written in C/assembly).

But everything on the path of the market signal to order would be highly optimised native machine code, but some of that code would be written in C and some of it would be compiled at runtime with Lisp.

Parsing the incoming multicast feed from the exchange was implemented with Lisp emitting super efficient machine code (almost no function calls) based on a bunch of XML files describing the messages. I shamelessly stole the idea from the book Practical Common Lisp (really good if you want to get into Lisp).

Things like business rules would be compiled into decision trees using current market situation and decision trees would be reorganised, optimised and compiled into machine code and inserted into the processing path. Most of the time a very large decision tree with hundreds of complex decisions (for example, taking into account market volatility) could be distilled to just couple branching instructions. This decision tree compilation would happen after every market signal, up to 10 thousand times a second (the exchange had a basic tick of 1/10000 of a second giving me guaranteed 100us before the next market signal).

Same with actual algorithms -- I wrote a small DSL for the traders and this DSL would be compiled with Lisp to machine an inserted into the processing path.

Some parts of the framework would be built with Lisp, too. It was easier for me to write a DSL and then compile it to machine than to write it in C.

If you want to understand one principle from all this is to look at all instructions and especially branches that you have between receiving the signal and emitting the order and try to figure out if you can eliminate it and if you can't, try to find ways to do it ahead of time even if doing it ahead of time requires a lot more effort.


Not my first rodeo :-)

I too made a real time hardware level trading system back in the day, on the back of building a multi channel seismic aquisition system with a custom RealtimeOS talking to a bunch of DSP cards that each sampled a trailing flotation cable that each had multiple microphones with the entire grid going toward building up a profile of the seafloor and layers underneath along with bobbing boat|cable cancelation, etc.

Very similar architectures in many ways - with rolling DSP filters in place of trading response algorithms .. etc.

Lisp -> ASM or Lisp -> C either way code generation from a higher language was a good way to get the heavy lifting done.


Cool project!

I guess when different people try to achieve extreme low latency or efficiency, the solutions start to converge into small number of ideas.


what would you do if you wanna run all this on a web server instead? sorry if thats the wrong question. how does your system interact with say android or ios clients or a webapp with ui


If you have a project that you want to see done, you will be more likely to succeed using some more traditional architecture.

What I described is chasing latency at all costs. The costs are hardware costs, maintainability costs, development costs, inefficiency (yes, a lot more CPU is used than what is needed just to run the critical path fast). This is very extreme situation and it would be very unlikely to be a good tradeoff for your application.

If you have a lot of clients connecting there are different tradeoffs to think about and different possible architectures to evaluate, but I can't help you not knowing what your problem is.


The low-latency core communicates with a more normal app via messages, over a socket or through shared memory or whatever. So you have normal UI code where a user can enter an instruction, then instead of writing to a database or making a HTTP request, it sends a message to the low latency process. That should respond to acknowledge it, and it will send more messages later if it makes any trades etc.


> implemented with Lisp emitting super efficient machine code

I'm really intrigued by your usage of Lisp. It's on my bucket list to learn and your post is very inspiring.

When you say Lisp was emitting machine code; are you referring to the machine code the Lisp compiler emitted for your Lisp application or was your Lisp application acting as a sort of JIT compiler and actually emitting machine code?

Maybe you can reference the section in Practical Common Lisp that introduces the idea?


Machine code isn't some kind of magical idea. It is just bytes and you can craft the bytes any way you want.

In my case I wrote my own compiler that generated machine code bytes based on my needs.

The compiler wasn't at all complicated. It was not general compiler that would have to be able to deal with everything you could throw at it. It only supported one model of CPU. It did not require any optimisation mechanisms because it was assumed the instructions it received were already optimised by high level logic that generated them. It would only support very small number of instructions or mechanisms, it was mostly manipulating registers and doing branches and jumps. It produced small pieces of code to be inserted into larger C program mostly, so that this C program did not have to run decisions on what to do based on something else.

You can also use SBCL to emit any vops you want to execute as part of your Lisp program. So if you want to have a piece of crafted code to run as part of your Lisp program (for example to do some complicated operation that would be inefficient in lisp) you can also do it. I actually used vops to do some low level stuff but that was another matter, most of the time I was just outputting a buffer with crafted native machine code which would then be sent to the receiving thread (core) to insert and execute it for the next market signal.


How did you inject the machine code into a running program on anything near a modern architecture? Unless you had your own OS, wouldn't code segments be RO and data segments not executable?

I don't know enough about any processor created in the last 30 years to know if running bare metal without a commercial OS would allow you to not have those constraints.


If you can modify your operating system you can do pretty much anything you want (Thanks, Linus!)


Can't you use `mprotect` to allow you to write to memory and execute?

https://github.com/Frodox/execute-machine-code-from-memory/t... has several examples of how to do it.

EDIT: https://www.kvakil.me/posts/2022-10-13-optimizing-mprotect-i... is another good link that explains this.


My team did it in Rust. Others in our company have done it in C++, making heavy use of available wizardry. It's widespread in the industry to do it in Java, albeit a strange subset of Java where you avoid most of the features.

I'd be amazed if anyone did it entirely in C. The productivity is just too low.


The question really is "why?" Most systems have relatively few hotspots or can be reorganised to have relatively few hotspots. You don't need to write everything in a low level language to just have those few things run fast, unless you are also facing other constraints like low memory.

I have better uses of my time than write megabytes of pointer manipulation in C. That's why I prefer high level languages that can easily coexist with C for those few parts that really benefit.


A trading system doesn't have a single hot spot that can be easily optimized. The critic paths can be hundreds of thousands of line of code.

There a lot of separate processes and subsystem that do not need to be absolutely latency optimal (or even at all), that are definitely amenable to be written on other languages. Of course when you already have a bunch of C++ programmes, things tend to end up being written in C++ even if it is not strictly necessary.


So are the wonders of polyglot programming, too many devs are too focused on the one language for everything.

Since 2006 that the projects I work on follow a similar approach, Java, C# and nowadays node as well, coupled with C++ when required.


I wouldn't bundle C with Golang and Java with Node. Golang and Java are roughly a tier on their own. Java is definitely a thing in this space[1], although it's a fairly unidiomatic style of Java that reduces allocations and puts a lot of emphasis on consistent low latency.

[1] e.g. https://marketswiki.com/wiki/TRADExpress ; though they've been absorbed into Nasdaq now and my visibility into CFT ended with that, no idea how much of their software is still running


Hopefully Panama will help to make it better, and there is still hope for Valhala.


Not a real system, obviously, but a high level overview of what it does: https://www.kalzumeus.com/2015/10/30/developing-in-stockfigh...


thank you for sharing, it has a few code snippets here and there. Would you be aware of a course or something (Google didnt help) that teaches how to build one from the ground up




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: