Hacker News new | past | comments | ask | show | jobs | submit login
Why are ethernet jumbo frames 9000 bytes? (2018) (dave.tf)
86 points by zdw on April 21, 2023 | hide | past | favorite | 72 comments



> Perhaps understandable given how hilariously little adoption there’s been of jumbo frames on the wider internet.

Jumbo frames are not intended¹ to be adopted on the internet, and likely never will be. You need end-to-end control of devices to make sure they work correctly (as the article points out, even PPPoE ⇒ 1492 has PMTU issues on the open internet, it's best practice to run PPPoE on 1508 so you get back to 1500…) But they are very much used inside of administrative domains, e.g. for SANs and cloud interconnects. It's visible when you are a cloud customer, but absolutely not on the open from e.g. your home or mobile internet.

[¹] by the current understanding; I have no idea if Internet2 was ever as bold as thinking it might be possible to roll out 9k on the Internet "1".


MTU is ridiculously difficult in telco networks. Because most transport is moving packets via Ethernet at the endpoints, you’ve got to make sure your customers can use whatever MTU they want, including 9000+. This means a lot of variation in frame size support, even within individual vendor product lines. Now, let’s see here…is a jumbo frame 9000, 9192, 9216, 9500, 10000, 16000 bytes in size? Who knows. That’s just Juniper. Extrapolate this uncertainty to dozens of different vendors and models of networking gear, AND how they each count bytes, interframe gaps, 802.1q and 802.1ad…


IPv6 got rid of the need to have end-to-end control of devices to get working arbitrary MTU along a path. On the other hand IPv6 also raised the minimum MTU to 1280 and got rid of fragmentation so a lot of big places actually just set 1280 instead.


I think it's only that in v4 the node has the option of requesting MTU discovery, like everyone practically does, or having the on-path routers do fragmentation transparently. In v6 it always works similarly as requesting path MTU discovery in v4 ("don't fragment", aka DF, header bit turned on).

So the simplification is that in v6 the legacy mode was eliminated.


The change with IPv6 was more than a simplification of removing fragmentation (though that was welcomed since most devices didn't do it anyways). In v4 you had the option to probe, you had to the option play it safe with the minimum 576 bytes, or you had the option to assume it was a larger value. In v6 spec compliant clients must either use the minimum 1280 or probe, the ability to assume anything is no more. The only reason things didn't fall apart in IPv4 despite the rampant blocking of ICMP for "security" reasons is everyone started using TCP which has MSS negotiation and MSS clamping.


The article misses the influence of the checksum.

I saw a graph at a the time showing how many single-bit errors the checksum misses, and the graph was flat-ish up to around 8-10k and rose after that.


Isn't Ethernet's CRC-check symmetric? Like a 1-bit parity, except 32-bits?

IE: A 1-bit Parity bit is symmetric, and will *NEVER* miss a single-bit error. 1-bit Parity can only miss 2-bit, 4-bit, 6-bit, 8-bit... errors.

I have an expectation for Ethernet's 32-bit CRC to have a similar property.

EDIT: I looked it up at the CRC Zoo. Ethernet is __NOT__ symmetric. https://users.ece.cmu.edu/~koopman/crc/c32/0x82608edb.txt

That means 1-bit, 3-bit, 5-bit (etc. etc.) errors can slip through.


I feel so virtuous now.

I remember being curious about why the graph didn't look like a simple function of the packet length (linear or some other simple function), but I did not procrastinate by investigating the function and learning about why.


Ehhh, ish?

The page I linked above clearly demonstrates that Ethernet-CRC32 is Hamming-distance 2 at 524288-bits (aka: 65536 bytes) of input, or 64kB. IE: Its still *impossible* to have a 1-bit error escape undetected on 64kB frames (let alone 9000-byte jumbo frames).

I don't know where the 1-bit errors start to occur, but its well beyond 64kB (which is all the CRC-zoo tested for).

--------

Maybe you're misremembering your graph / data? There's also CRC-16, or even CRC-5 (aka: USB uses CRC5). Based on various analysis, its clear that CRC32 is sufficient for any typical Ethernet frame of any size under 64kB (and probably for many frames of size larger).


Or what kind of error they were looking at. Bit insertions or removals perhaps.

I can't find the thing I looked at. Another paper I found now writes that 64kB is far beyond the limit, though: ¨With Ethernet, the FCS computation uses a 32-bit cyclic redundancy check (CRC-32). CRC-32 error checking detects bit errors with a very high probability. But as frame size increases, the probability of undetected errors per frame may increase. Due to the nature of the CRC-32 algorithm, the probability of undetected errors is the same for frame sizes between 3007 and 91639 data bits (approximately 376 to 11455 bytes). Thus to maintain the same bit error rate accuracy as standard Ethernet, extended frame sizes should not exceed 11455 bytes." https://web.archive.org/web/20110807131142/staff.psc.edu/mat...

The thing I read at the time didn't agree that 11k and 9k have the same accuracy, though.


"Jumbo frames" aren't any size. There is no standard. The standard is IEEE 802.3 1518 byte frames. Anything larger is a non-standardized agreement between link partners on a broadcast domain.

There are NICs which only support 4K frames. Those are also "jumbo frames".

It's convention to call anything from 1501 to 9216 as "jumbo" and anything up to 64k "superjumbo".


> There is no standard.

There are standards, they just aren't relevant. 802.11(n) specifies 7935 bytes (A-MSDU), 802.3(as) specifies "frame envelope" at 2000 bytes. (The latter is to have room for encapsulation headers to be able to deliver 1500 regardless of encap.)


802.11 isn't Ethernet (though does have a larger frame size) and A-MSDU in particular standardizes aggregation of multiple MAC frames not an extension of the MAC frame. However larger frame envelopes are definitely a standardized thing for Ethernet but the IEEE folks will fight to the death it's a separate concept as you say.


Oh, yeah, don't get me started... It is SO damn freaking annoyinggly hard to set up a common packet size for all my devices! Completely ridiculous! For example, I have NAS-to-Switch 10G fiber that supports > 9000, switch-to-clients 10G fiber and 1G copper, and every damn machine has different NIC with not only different frame caps (some 4K, some 8K, some 9K, others unknown), but they represent different numbers!! For example, Intel vs Mellanox accept byte numbers with and without headers (don't remember which is which) - try guessing! That's in Windows, of course. Where vendors design their UIs/configs.


Also, these problems are hard to debug. Oh, and don't ask me what happens if someone crams in a small dumb noname switch in the net...


Huh where can I read more about super jumbo, I could use it.


Start with your NIC driver. Most I've seen don't support frame sizes over 9000ish, which I assume reflects a hardware limitation.


> And of course, IPv6 supports 65k packets out of the box, and already has an extension in case you want to send more than 65k per packet.

IPv4 and IPv6 both support 65k packet lengths.


> Meanwhile, the internet will be over here, still sending 1500 bytes at a time.

Is this part true? Aren’t the larger data center interconnects using 9000 bytes internally? Also, wouldn’t fat large distance links be using larger packets? I would have thought that a 4% wire overhead (and who knows how much CPU) would have pushed people away from 1500 for anything but last mile.


It probably depends on who's running the DC, and how well traffic is separated.

Path MTU detection is all sorts of fragile, so exposing a tcp maximum segment size implying a MTU above 1500 is asking for trouble; actually even just 1500 isn't always a good idea [1]. There are many networks where too big packets are dropped without notifications, and if the other end is also on a 9000 MTU LAN, then you can get stuck. sigh

On the host side, packetization is not that big of a deal; nics can help, but it's also just not that many packets. Where hosts tend to run out of cpu is when you're dealing with much smaller packets.

Larger packets could help routers of course. And larger packets would probably mean fewer acks, which would be nice too. But, it's unlikely to happen unless mtu probing is enabled in more places.

[1] The best is to send the lesser of (theirs - 28) or your actual value. Well best is relative, this will result in the most successful connections, at the expense of more packets for networks where people aren't insane.


The main reason why Path MTU detection is fragile is that many people ignorantly filter all ICMP traffic (including type 3). But educating people about ICMP and its role in PMTUD [1] seems like a lost battle. I wish to be proven wrong but I have an impressions that younger developers, devops/SRE and even network engineers know about network protocols less in than 10-20 years ago. Nowadays I rarely meet people how know what Path MTU Discovery is. I even stopped asking about it on job interviews (for devs/ops) because no one can answer, it's a luck if a candidate can tell anything about ICMP at all.

[1] http://www.znep.com/~marcs/mtu/


The two things that boil my blood when dealing with SecOps people are:

1) Dropping all ICMP packets on the floor for "security" reasons, which means that basic diagnostics is now impossible, you get random issues with software that uses PING, and Path MTU Discovery is broken forever.

2) Leaving internal firewall ports with the default settings, which are intended for Internet-facing ports. For example, for "denied" incoming traffic from the Internet, the correct response is to silently drop the packet. Internally, the correct thing to do is to respond with a NACK to instantly close the connection. Without this, you spend days and days chasing down random and difficult-to-troubleshoot timeouts and weird 30-second delays all over the place.

As a random example, Windows RDP makes a HTTP call out to the Internet from the server to verify a CRL. This is safe and secure. Without this, bad certificates can't be blocked. Unfortunately, this happens in the "SYSTEM" context, which tends not get proxy settings applied to it, so it is often blocked, with a 30-second timeout. This failure is cached for 24 hours on the server, which causes a maddening delay when you connect to servers. Every day. Every server. But you can't reproduce it, because the second time it won't happen. (It also won't happen if anyone else connects to the server right before you.)

Years later, I still get angry remembering the snarky comments by the firewall guy saying that I'm just imagining things.


That's one reason, but there are several other common issues.

At high volume mtu bottlenecks, the routers dropping packets are likely to limit how many ICMPs they send; otherwise they'll run out of CPU. This wouldn't be terrible if ICMPs weren't so commonly dropped.

Some bottlenecked routers don't have globally routable IP addresses, and may not be able to send ICMP at all. This can happen inside long haul networks, where there may be tunneling that reduces the effective MTU. This seems less common today, but I dunno? One 'recent' advance is that some PPPoE networks use 'mini jumbo' frames, with an ethernet MTU of 1508, so that the PPP mtu is 1500; that may be happening on other layers with tunnels too.


Yes, unfortunately it is not hard to find routers without globally routable IP on a loopback or interfaces (whatever is used as the source for ICMP unreach in a given configuration) but if they have any IP at all, even a private IP, PMTUD still may work as long as ICMP messages with private source IP are not blocked by a firewall. But when a packet is dropped inside a carrier network (and not on its edges) because of a MTA mismatch it looks like a gross misconfiguration to me.


Isn’t it baked into IPv6 as something that happens automatically? I’m seeing better traction with ISPs seeming to finally hand out IPv6 addresses.


IPv6 still has PMTUD but ICMP blocking in v6 is less common because blocking of all ICMPv6 messages breaks network in more obvious ways than blocking of all ICMP v4 messages.


Absolutely. JF adoption is very poor and packet dropping is a (frustrating) thing. To be compatible, DCs need to invest in really powerful border/backbone routers (and their configuration) for fast enough packet splitting.

PS: Even at CERN, where cooperation is supposedly tighter, JFs were only an experiment, and, AFAIR, not a very successful one.


No one really peers at above 1500 without some really good reason. There be dragons.


I remember observing my colleagues who had spent like a day troubleshooting an issue. The root cause was one end of a link sending super jumbo frames, which the other one was not configured to accept.

It may sound simple but it was pretty hard for them to reproduce.


Pretty good if they did it in a day. Encountered the same issue and it took me at least three days to resolve. The issue presented itself as a database table not updating.


It was three to four people who were poring over packet analysers, so probably a similar number of man hours.


For me it was network going down after maximizing SSH console window :D


Once you've seen it happen you think to check for it pretty quickly.


I have CenturyLink business internet with a static IP. It has 1492 MTU. So technically 1500 isn't always true, sometimes it's smaller :)


I know you’re just making a joke, but I was very careful about distinguishing MTU in the last mile from MTU within the routers powering the world’s interconnectivity.


We have two 100Gbps links with CenturyLink and we use an MTU of 1500.


By nature, interconnect effective MTU is governed by last mile MTU.


With modern network cards, at the endpoints, the CPU overhead is zero because of TCP segmentation offload and generic receive offload. And routers use ASICs, not CPUs.


this is not true at all. All modern routers use a combination of ASICs and CPUs. some of the data plane even runs through the CPU, depending on the protocol. also, TCP segmentation offload does not mean that the entire TCP protocol is offloaded onto the NIC. only a small part of it is and the rest of it still goes through CPU.


With TSO and GRO, the CPU deals with jumbo frames and the NIC handles splitting and combining them.

You are correct that intermediate boxes all take a perf hit for the extra packets, because routers and other boxes are usually packets per second bound at some point, rather than bandwidth.


HTTP/3 uses UDP. GSO can help with UDP, but is considerably more limited than TSO for TCP. The TCP interface to the kernel allows arbitrarily large buffers of data to be provided by the app, which the kernel/driver/card can segment. The UDP interface caps this at 64K


Jumbo frames are just a guarantee that you'll have to debug a complicated problem in the future


Can confirm. Although we used it fairly successfully in our datacenter. We had a few instances where setting the NIC config failed and someone ended up trying to debug a strange error who didn’t know anything about jumbo frames. It was normally only an hour before they reached out and we got it fixed.

This normally only happened when servers arrived with new NICs and no one thought to give us a heads up.


Can you share a bit more detail regarding overall network performance before and after deploying jumbo frames in your DC?


Layer 2 (Ethernet) DCI will often be set up with 9000 (or larger) MTUs - the reason for this is to allow end-customers to carry their traffic using whatever encapsulation they want to internally (eg: Q-in-Q, MPLS, VXLAN, MACSEC, GRE or all of the above simultaneously), and the carrier/DC will transparently support it.

But when it comes to external L3 peering between 3rd-parties (eg: the Internet), it's very rare that people will mess with the IP MTU. Unless you can guarantee that the IP PMTU is higher than 1500 bytes (across multiple segments that you may not control, including ISP last-mile) there is simply no benefit.


Some interconnects between separate organizations are jumbo but those tend to be PNI or secondary vlans.

The reason the Internet still runs exclusively on 1500 byte IP MTU is that any lower MTU in the path will effectively make the jumbo segments useless. This means any PMTU-D problems customers experience would have been unnecessary and avoidable by just using 1500 everywhere.

Many public IX's (peering exchanges) have a jumbo vlan but it's almost always separate from the standard one that only allows 1500 byte MTU.


Yes, most large data centers use bigger frames internally. it's only when you get past the cgnat router that they go back to the compatible sizes.


What I want to know is why are AWS jumbo frames 9001 bytes?


From an older post by someone (@spaceprison) who seems familiar with the rationale:

> 9001 bytes was the absolute max they could give to customers and still be able to do packet shenanigans.

https://news.ycombinator.com/item?id=30838866

See post for more detailed context.


Because it's over 9000


Really, because of a meme?


ISO compliance?


I wonder whether this was at all influenced by SPARC Solaris's 8K page size. Sun was still a pretty big player in the internet and networking around this time.

Or maybe it's transitive, 8K pages influenced NFS transfer size, or retrospective, NFS transfer sizes influenced page size (I don't know the history of Solaris page sizes or NFS protocols).


More likely a multiplication of x86's 4096 byte page size


Why?


Over time packets have actually been shrinking over the internet. The MTU for LTE is 1428 for example. On top of that IPv6 has a 1280 bytes as the minimum requirement and this is close enough to 1428/1500 a some places find it easier to just say "and so 1280 it is" to never think about supported frame sizes again.

Unrelated but fun: Of all the enterprise gear in my home network lab a $25 TP-Link consumer switch from Amazon has the highest MTU at 15k jumbo. TL-SG108E


> The MTU for LTE is 1428 for example.

Is it? As far as I can tell, T-Mobile US gives me a 1500 MTU with my LTE modem, and 1416 on my cell phone.


I knew 3G was carrier dependent but had always seen LTE as 1428 https://customer.cradlepoint.com/s/article/how-to-determine-.... Could well not be the case though, I definitely haven't played with every combination of carrier, plan type, and modem. PCO does allow the carrier to push the supported MTU to a client so maybe that's what's happening here just up instead of down.


9000 is the standard "customer" jumbo IP MTU for reasons outlined in the article but ISP backbone links are necessarily larger than that to accommodate other headers, typically MPLS and/or Vxlan. Ipsec also has layer3/4 overhead but that is usually used by the customer not the ISP in their infrastructure.

In addition to the IP MTU you also have layer2 headers, almost always Ethernet these days. Ethernet can optionally have one or more 802.1Q headers, among other things.

Networks that use only one vendor sometimes just configure their internal links at the maximum supported by that vendor/platform. In multi-vendor networks this is problematic because each vendor has different maximums. In that case it is best to set an internal standard like 9100 IP MTU that is within every vendor's limits and leaves plenty of room for all the overhead (layer 2 and 3) listed above.

It's important to note that some protocols like OSPF require all participants on a segment to have the exact same IP MTU configuration.

EDIT: clarified 9000 is the standard customer jumbo IP MTU size, not standard customer IP MTU size which is of course 1500.


I've also seen a lot of places leave IP MTU at the default 1500 and only utilize jumbo for local L2s (e.g. storage) or encapsulated traffic (e.g. EVPN). This avoids the whole IP fragmentation set of pitfalls while still allowing blind transport of larger packets otherwise.


this article is missing one of the main points for not going much larger. The larger you go, the more data you have to resend on a retransmit. Ethernet doesn't have fragmentation, so you can't simply resend a piece of it. in other words, if you had a 1MB frame that was lost, you would have to retransmit the entire thing again.


> The larger you go, the more data you have to resend on a retransmit.

The reverse is true also! From TFA:

> There’s certainly a performance incentive to not fragment [8192-byte] NFS traffic.

Nowadays, most NFS traffic is likely over TCP. But back in the day it was not. So dropping 1 frame out of 6 that comprise a packet, meant that you had to retransmit all 6. BTW this is why DNS has lots of compression and limits on UDP packet size before [compliant] implementations switch to TCP. NFS has no such provisions.


At 10 Gbps, with 9000 byte packets there's 1.1 million packets/second.

Compare that to say, a 56K modem that can handle all of 4 1500 byte packets/second.

Also, not only time to retransmit matters, but the amount of power we expend per packet. I think it's ridiculous to have thousands of times more overhead than necessary, and this likely hurts a lot more than the rare retransmission.


That and the checksum (as it was originally designed) becoming relatively useless at larger sizes.


I always thought it was one of those legacy things because the max frame size for frame relay was 9000 Bytes. But in theory, a frame for Ethernet can be 64K I believe but that is not really practical of course for reasons mentioned in other threads here about re-trans.


Probably a bigger reason to use MTA less than 64k (9k-16k) is than 64k frames can quickly fill up small buffers in switches and routers and memory allocator in OS kernels may also be optimized for lower values.


Historically, like 20, 22 years ago it was a LOT more costly to put some high speed ram in 24/48 port gigabit Ethernet switches for port buffers. Now we live in an era of high speed cheap memory so even throwing gobs of ram at rib and fib on routers meant to take multiple full bgp tables isn't so costly.


Jumbo frames are used only in LAN (mostly DC LAN) where packet loss is low and retransmits are rare.


Are jumbo frames needed any more?

Most NICs have various types of offloading now, especially if you get into 10+ GigE, so how much of a bottleneck is frame processing nowadays? (Either on the send or receive side.)


Can confirm there's still a large performance difference in classical network methods sockets) when you're trying to reach 10Gb. All the nice things implemented down there have an interesting effect if you're using lots of parralel streams and applications, but for those of us streaming lots of data in real-time, jumbo frames are still very much useful. I noticed a 10 to 20% cpu load difference between the two. I've even forgone IP fragmentation for these kinds of streams, application-level packet fragmentation/reassembly with an acceptable loss rate.

Now we're moving to 100Gb and soon 400-800Gb and anything helps up there. I'm waiting for someone to standardise some way to reassemble jumbo frames into large 1MB+ 'application packets' using custom headers and reassembly algorithms directly in NICs (hopefully something standard like P4) before DMA-dispatching them to user memory.


I'm not sure they were ever truly needed (just as 1500 isn't truly needed, 576 bytes works "almost the same" as well) but there is some nicety in certain payloads, such as the aforementioned NFS, either completely making it or completely not.

At high enough speeds even offloads prefer frames not be tiny just because they can. Even most data center 100G+ switching is not line rate below 256 byte packets and the only thing it's designed to do is transport packets as quickly between ports it can without really inspecting them. Also encapsulations/overlays make the header loss ratio even greater across network to network links in the path.

In the end though most places can do without jumbo without a noticeable impact. There can be buffer scheduling problems with ridiculously large frames as well.


Disregarding everything else, you still lose a minor amount of performance to the IFG (inter-frame gap) and headers that need to be repeated. The offloads on the NIC — and switches/routers inbetween — also consume some small but non-zero amount of power per frame to do their work, which if you multiply by the number of servers in a Cloud DC…




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: