When I think about serving large-scale LLM inference (like ChatGPT), I see it a lot like high-speed web serving — there are layers to it, much like in the OSI model.
1. Physical/Hardware Layer
At the very bottom is the GPU silicon and its associated high-bandwidth VRAM. The model weights are partitioned, compiled, and efficiently placed so that each GPU chip and its VRAM are used to the fullest (ideally). This is where low-level kernel optimizations, fused operations, and memory access patterns matter so that everything above the chip level tries to play nice with the lowest level.
2. Intra-Node Coordination Layer
Inside a single server, multiple GPUs are connected via NVLink (or equivalent high-speed interconnect). Here you use tensor parallelism (splitting matrices across GPUs), pipeline parallelism (splitting model layers across GPUs), or expert parallelism (only activating parts of the model per request) to make the model fit and run faster. The key is minimizing cross-GPU communication latency while keeping all GPUs running at full load - many low level software tricks here.
3. Inter-Node Coordination Layer
When the model spans multiple servers, high-speed networking like InfiniBand comes into play. Techniques like data parallelism (replicating the model and splitting requests), hybrid parallelism (mixing tensor/pipeline/data/expert parallelism), and careful orchestration of collectives (all-reduce, all-to-all) keep throughput high while hiding model communication (slow) behind model computation (fast).
4. Request Processing Layer
Above the hardware/multi-GPU layers is the serving logic: batching incoming prompts together to maximize GPU efficiency and mold them into ideal shapes to max out compute, offloading less urgent work to background processes, caching key/value attention states (KV cache) to avoid recomputing past tokens, and using paged caches to handle variable-length sequences.
5. User-Facing Serving Layer
At the top are optimizations users see indirectly — multi-layer caching for common or repeated queries, fast serialization protocols like gRPC or WebSockets for minimal overhead, and geo-distributed load balancing to route users to the lowest-latency cluster.
Like the OSI model, each “layer” solves its own set of problems but works together to make the whole system scale. That’s how you get from “this model barely runs on a single high-end GPU” to “this service handles hundreds of millions of users per week with low latency.”
Hi Author - thank you very much for the clear and relatively easy-to-understand MPK overview. Could you please also comment on the similarity of your project to Hidet https://pytorch.org/blog/introducing-hidet/
This story is dear to my heart. Let me tell you why - this is the tale of how my wife of 15 years, bless her heart, an occasional unstable genius, proposed a startlingly effective method for eradicating these invasive pythons.
She slammed her coffee cup down one morning with the conviction of an Old Testament prophet and declared: “Exploding rabbits.”
“Excuse me?” I said, wiping marmalade off my chin.
“Exploding. Rabbits. Stuff ‘em with quarter pound of C4, or maybe just enough tannerite to surprise the neighbors but not call down the FAA, and set them loose in the Everglades. Pythons love rabbits. Boom. Problem solved. You’re welcome, America.”
Now I’ve heard my share of madcap schemes. Once she tried to compost credit card offers. But this time she looked me square in the eye with the righteous glow of a woman who had just solved two ecological crises and accidentally founded a billion-dollar startup in the process.
“We’ll call it Hare Trigger™,” she added, deadpan. “It’s got product-market fit and explosive growth potential.”
She even sketched out a logo involving a jackrabbit with aviator goggles and a plunger.
I asked if this might attract some sort of federal attention.
“Good,” she said. “That’s called buzz. Besides, the pythons started it.”
And just like that, I found myself wondering how far true it is that behind every successful man stands an even more genius woman. Waiting for Elon to offer Series A.
For decades, Guam has had an invasive brown tree snake problem that completely decimated its native bird population and then some. Your wife's proposed solution reminded me of something the military actually tried: parachute thousands of dead mice "pumped full of painkillers" across the island...no joke[1].
I suppose the key difference in liability here is: see a dead mouse on Guam, hope your pet doesn't eat it...see one of your wife's rabbits (live or dead) in Florida, immediately cordon off and notify local UXO disposal team.
Tannerire, when crushed, initiates a small explosion, which triggers the C4. When a python swallows a rabbit, it then crushes the rabbit's body with its powerful muscles, to help the digestion. This is when the explosion would happen.
This is my best understanding. I have no idea where inside a rabbit there would be room enough for the C4 and tannerite, and how to put it inside enough rabbits.
From what I’ve read a .22 caliber bullet or heavier is necessary to ignite tannerite. I’m not a herpetologist, nor a firearms expert, but I have my doubts that a python ingesting prey produces similar forces as a .22 striking a target, at least in terms of force per area.
GPU sharing is a concern for sensitive data. It is more appropriate to increase the utilization rate of GPU chip internals via a variety of low-level (CUDA and below) optimizations.
Optimizing AI performance is like peeling an onion — every time you remove one bottleneck, another layer appears underneath. What looks like a compute problem turns out to be a memory bottleneck, which then turns out to be a scheduling issue, which reveals a parallelism mismatch… and so on.
It’s a process of continuous uncovering, and unless you have visibility across the whole stack — from kernel to cluster — you’ll spend all your time slicing through surface layers with lots of tears being shed.
Fortunately, there are software automation solutions to this.
Canadian advisories (presumably from NAV CANADA) are fed into the ATCSCC system. You can see that more explicitly on the less slick version at https://www.fly.faa.gov/adv/advAdvisoryForm.jsp
Plenty of US domestic flights leave YYC every day. You clear US customs and immigration in Canada before getting on the plane so you can land at any airport in the entire country, not just international ones.
Model training observations from both Llama 3 and 4 papers:
Meta’s Llama 3 was trained on ~16k H100s, achieving ~380–430 TFLOPS per GPU in BF16 precision, translating to a solid 38 - 43% hardware efficiency [Meta, Llama 3].
For Llama 4 training, Meta doubled the compute, using ~32K H100s and switched to FP8 precision. Despite the precision gain, observed efficiency dropped to about 19.7%, with GPUs delivering ~390 TFLOPS out of a theoretical 1,979 FP8 TFLOPS [Meta, Llama 4].
I am not the one to critique, and rather, this is a recognition of the enormous complexity of operating GPUs at this scale. Training massive models across tens of thousands of GPUs stretches today’s AI infrastructure to its limit.
Besides accelerating inference workloads, advanced GPU optimizations can be integrated into training and fine-tuning pipelines. From various kernel optimization techniques (over 90) to increasing memory access efficiency and scaling up to cluster-wide resource coordination, efficiency can be maximized with some complex software.
That's about the same number for DeepSeek-V3. If you count in fp8 MFU is about 20%. MoEs are hard.
That could also be why they did fp8. If we use theoretical performance of bf16 as baseline (I know this makes few sense, but for compare with previous trainings it's convenient) the about 40% MFU, not too bad.
IOW, MoE kills training MFU and they had to do fp8 to make it not looking funny. Both DeepSeek and Meta GenAI.
It's not just scale. Even for single GPU, it is hard to acheive 2x speed improvement as the GPU specs states. Even NVIDIA's own Tensor Engine acheives 28% extra FLOP/s[1].
And the practical flops always end up lower. As an example a V100 has 125 according to spec, but the ideal case is more like 100 and non-ideal like 60.
Never trained a model, but the precision confused me as I've never considered how many bits should be reserved for exponent/mentisa.
Has anyone architected a model(somehow) such that it has a free hand at using the give bits / choosing the type, or changed types from layer to layer, I mean surely when training for example vision models the first layers deal with the "big(yet simpler) picture"(light/dark, lines etc) where as the last layers are with the fine details etc.
Even though it may not suitable for (existing) hardware impl, it may be advantageous in other place for example in learning rate speed.
You can't choose arbitrary bits of mantissa, because what types are allowed is defined by the underlying hardware and instruction set (PTX for Nvidia). People have done some exploration of which layers can be quantized more vs. which need to be kept in higher precision, but this is usually done post-training (at inference time) and is largely empirical.
While the other commentator is correct -- you can't just choose arbitrary floating-point formats if you want to run performantly on existing hardware -- there is some variety to choose from once you get down to the lower precisions. At 16 bits you can take either the standard IEEE fp16 format (1/5/10) or the exponent-heavy bf16 (1/8/7); for 8 bits, there technically is no IEEE specification, but in practice the E5M2 format (1/5/2) serves as "IEEE-equivalent" while E4M3 (1/4/3) takes some liberties with NaNs and drops infinities altogether -- and both are supported on recent Nvidia GPUs.
So between these four you honestly cover _most_ of the desired solution space: e.g. it's hard to imagine wanting to give up more of the mantissa than you already do on E5M2, while E4M3 is already at the lower bound of dynamic range before you need to start giving up IEEE compatability (which can definitely be a pain). There's some room left at the fp16 level but in practice bf16 was already designed for use in neural networks, so in practice people are happy using it for training and then leaving inference to fp16 (which has higher precision).
The only thing that's missing is support for more esoteric formats, e.g. fp4 (E2M1, E3M0) and maybe packed ternary.
1. Physical/Hardware Layer At the very bottom is the GPU silicon and its associated high-bandwidth VRAM. The model weights are partitioned, compiled, and efficiently placed so that each GPU chip and its VRAM are used to the fullest (ideally). This is where low-level kernel optimizations, fused operations, and memory access patterns matter so that everything above the chip level tries to play nice with the lowest level.
2. Intra-Node Coordination Layer Inside a single server, multiple GPUs are connected via NVLink (or equivalent high-speed interconnect). Here you use tensor parallelism (splitting matrices across GPUs), pipeline parallelism (splitting model layers across GPUs), or expert parallelism (only activating parts of the model per request) to make the model fit and run faster. The key is minimizing cross-GPU communication latency while keeping all GPUs running at full load - many low level software tricks here.
3. Inter-Node Coordination Layer When the model spans multiple servers, high-speed networking like InfiniBand comes into play. Techniques like data parallelism (replicating the model and splitting requests), hybrid parallelism (mixing tensor/pipeline/data/expert parallelism), and careful orchestration of collectives (all-reduce, all-to-all) keep throughput high while hiding model communication (slow) behind model computation (fast).
4. Request Processing Layer Above the hardware/multi-GPU layers is the serving logic: batching incoming prompts together to maximize GPU efficiency and mold them into ideal shapes to max out compute, offloading less urgent work to background processes, caching key/value attention states (KV cache) to avoid recomputing past tokens, and using paged caches to handle variable-length sequences.
5. User-Facing Serving Layer At the top are optimizations users see indirectly — multi-layer caching for common or repeated queries, fast serialization protocols like gRPC or WebSockets for minimal overhead, and geo-distributed load balancing to route users to the lowest-latency cluster.
Like the OSI model, each “layer” solves its own set of problems but works together to make the whole system scale. That’s how you get from “this model barely runs on a single high-end GPU” to “this service handles hundreds of millions of users per week with low latency.”
reply