Google opens Falcon, a reliable low-latency hardware transport, to the ecosystem

mintplant · on Oct 18, 2023

Oh, I was hoping this would be something built more directly over Ethernet, rather than on top of UDP/IP (if I'm understanding the layer diagram correctly).

I've been working with Ethernet devices a lot lately, using the network as a communication bus, essentially. I find that there's a lot of complexity that we simply don't need: ARP, DHCP, DNS... So many points of failure. We know all the devices on our LAN and their unique MAC addresses, and could do everything we need to addressing-wise at Layer 2. But everything's built on Layer 3 and up, so we're effectively working backward to map devices to IP addresses and vice versa. It's unsatisfying.

wmf · on Oct 18, 2023

1. Your forwarding table would have to be larger because Ethernet uses exact match instead of longest prefix match. For example, you might be limited to 128K servers total while Google has millions.

2. The Ethernet header has less entropy for ECMP than a UDP/IP header. Maybe you could add entropy somewhere but ASICs may not support it.

3. You're breaking compatibility with... everything. Maybe Google could afford this but no one else could.

Dylan16807 · on Oct 18, 2023

1. In a single data center scale computing system, though?

2. Use some bytes outside of the header?

3. I get the impression this needs hardware to really use well anyway.

kortilla · on Oct 18, 2023

2. Don’t do this. It causes reordering between flows and software doesn’t like out of order packets.

Zandikar · on Oct 18, 2023

I get the impression that if this comment chain continues we're just gonna reinvent UDP

wmf · on Oct 18, 2023

Exactly. QUIC and Falcon use UDP because they basically have to. You can't avoid it.

tuetuopay · on Oct 18, 2023

1. Especially in the datacenter. When you add VMs to the mix you get LOADS of devices to address. Add on top that a single device has multiple connections (management, internet, storage, etc), you'd run out of capacity almost instantly.

2. The point is to find some bytes that are constant for a given logical stream of related packets. Taking bytes outside of the header means taking bytes from the payload, that is by definition not deterministic. That's why everything identifies flows using the IPs + ports + protocol.

Dylan16807 · on Oct 18, 2023

> Especially in the datacenter.

Okay, but my point there was saying "Google has millions of servers" isn't relevant, we're not looking at the entire company.

Even with a few addresses per VM, how many racks do you need to put into the same shared-compute mass? One data center is the upper limit, but it doesn't have to be the entire data center.

tuetuopay · on Oct 18, 2023

Back of the envelope math using modern hypervisors that can fit loads of vms in a single U

- let’s put 500 vms on a single one. with 128c256t CPUs it’s easy

- say you can fit 30 of those in a single rack (the common rack is 42U) due to power constraints

- and place 10 of those racks

That’s 500 x 30 x 10 = 150000 nodes to address. With 10 racks you already blow past the scaling limits of the common datacenter switch when it comes to MAC addresses. Here are the limits for Cisco’s Nexus 9000 series, a very common datacenter switch: https://www.cisco.com/c/en/us/td/docs/switches/datacenter/ne...

cereal_cable · on Oct 19, 2023

Plus layer 2 switches when they don't know the destination port will flood. With so many hosts that would be absolutely horrific.

I've seen computers at moderately sized LAN parties (talking 100 nodes, far from the large or even massive events) that were literally crippled by the broadcast traffic. At some point the flooding and layer 2 discovery (ARP) would do the same as well.

Limiting the broadcast domains with layer 3 really makes the Internet possible. Sure, you can have less overhead and simply do layer 2 only, and really it is completely possible. It's just such a rare use case that it in practice isn't important enough to actually do.

Bluecobra · on Oct 18, 2023

“Layers are only ever added, not removed.”

This is a long read but you might find it enjoyable:

https://apenwarr.ca/log/20170810

stonegray · on Oct 18, 2023

Adding is favoured over subtracting in problem solving

https://www.researchgate.net/publication/350711603_Adding_is...

traxys · on Oct 18, 2023

I think in High-Performance Computing a lot of the networking NICs behave in that way, because you are 100% sure that you know the fabric layout.

Stuff like InfiniBand, HPE Slingshot, Atos BXI, ... There is a consortium that's building a specification for those kinds of things: https://ultraethernet.org/

MichaelMoser123 · on Oct 18, 2023

There are systems like this: Fibre Channel has it's own data link layer, actually they do reliability at the data link layer! I think InfiniBand is similar to this respect.

Actually it's interesting that google didn't choose any of these, for their high bandwidth storage needs. They have the money to do their own thing, but why should they?

pclmulqdq · on Oct 18, 2023

Infiniband and most other specialized protocols have data-link-layer reliable messaging. The downside of this is that a congested switch cannot drop packets, so you end up backpressuring your network to death unless the people writing software really know what they're doing. Google was not able to make this work at scale.

pkhuong · on Oct 18, 2023

Google migrated away from IB years ago. IIUC, the failure modes at the time (e.g., fully locked up fabric) were too painful at the time, and they preferred to work with a mostly vanilla Linux kernel for userspace networking.

wayfinder · on Oct 18, 2023

Ah yes, instead of going to google.com or 192.168.1.1 or adding a printer connected to my Wi-Fi, let me open my big yellow pages of globally unique MAC addresses…

otabdeveloper4 · on Oct 18, 2023

> let me open my big yellow pages of globally unique MAC addresses

Yeah, well, you've basically described IPv6.

sa46 · on Oct 18, 2023

How do networks manage the larger number of IPv6 addresses?

My cursory digging indicates that the secret sauce is to grant large IPv6 prefixes and delegate routing to the prefix. An informative-looking Reddit comment says there are 100k IPv6 prefixes (as of Oct. 2020), and each active route takes 1 KiB. [1]

So, IPv6 differs significantly from MAC addresses because you only need to track prefixes.

[1]: https://old.reddit.com/r/networking/comments/j9twgq/how_much...

windexh8er · on Oct 18, 2023

Prefixes are routed. ARP, in IPv6, is replaced by a function of Neighbor Discovery Protocol (NDP) called Neighbor Advertisment (NA) and Neighbor Solicitation (NS) where L2 MAC is sent in it's messaging. The NS messaging leverages multicast to communicate with the broader set of hosts for discovery. Basically it sends a message to the multicast address containing the IPv6 address such that it can discover it's MAC. So the flooding/broadcasting for MAC of v4 is replaced by a much more efficient L3 to L2 lookup in v6.

bauruine · on Oct 18, 2023

rfc4941 is from 2011 if you use an OS that isn't using it you have bigger problems.

allanrbo · on Oct 18, 2023

I think op was just saying as a bus on a local network for some very specialized application.

wayfinder · on Oct 18, 2023

But you can do that already. DNS, ARP, TCP/UDP, and IP are all completely optional.

Just write a program to send and receive Ethernet frames. I believe it doesn’t have to contain IP.

camtarn · on Oct 18, 2023

Indeed - see industrial protocols such as EtherCAT or Powerlink, which use Ethernet frames with their own non-IP protocol on top.

namtab00 · on Oct 18, 2023

aren't also ATMs using a protocol of the same "species"?

gosub100 · on Oct 18, 2023

I think you mean white pages.

NavinF · on Oct 18, 2023

If you're flexible enough to forgo UDP/IP, why not use infiniband instead of ethernet? That gets rid of all the complexity you mentioned but still gives you ordered streams

fmajid · on Oct 18, 2023

Infiniband is basically a nVidia monopoly since they bought Mellanox, and the hyperscalers who already chafe at nVidia's GPU pricing power don't like it one bit, which is why they are working so hard on getting rid of it.

I don't understand enough about niche high-performance interconnects to know if CXL is a viable alternative for Infiniband where Ethernet-based solutions have too much latency.

MichaelZuo · on Oct 18, 2023

I agree, it's hard to see what real world use case would benefit from dumping UDP/IP while holding on to Ethernet, and not moving over to an Infiniband or similar solution. That also can't be satisfied by one of the existing specialized solutions that another user mentioned, such as EtherCAT.

jpleger · on Oct 18, 2023

But you can't route MACs... you can't associate domain names to MACs...

Maybe for a LAN that's fine, but for connecting devices between broadcast domains, it just doesn't cut it.

gertrunde · on Oct 18, 2023

You could view the vxlan/evpn/mp-bgp combo as routing MACs... sort of :)

nine_k · on Oct 18, 2023

The subject area is likely not just a LAN, but a subnet connecting a single rack, basically one physical segment. HPC, not cloud.

Because else you need ARP, IP, UDP, TCP, etc.

MiddleMan5 · on Oct 18, 2023

Why though? You can't route MAC because... ? Because ipv4 provides a higher entropy address? Because MAC is self-assigned and reduplication would require a higher level system? or just because we just don't use MAC addresses that way?

I'm certain there are reasons IP came to live alongside/on top of MAC, but saying you can't do multi-hop routing with it just isn't true. If all the technologies of the Internet were reset tomorrow, how might you design the perfect layer 2 addressing and routing system?

kazinator · on Oct 18, 2023

MACS are random. Given a MAC and a connection to a LAN, you can easily answer the question, "is there a station with that MAC here". If it's not here, and you have a single gateway to another network, you can figure out that to talk to that MAC, you need to go over a gateway. And then things eventually go funny. We hit a network that talks to four others. It has no idea where to send the packet destined for that MAC. It could send it to all four (multicast). Then when a reply comes from one of them, remember that destination for next time. Remember for how long? Sending a packet to every destination will cause an exponential explosion of that packet throughout the network.

It works on small scales. We can stitch together a few LANs with ethernet switches. The switches initially forward everything to all ports, but learn where the MACs are so as to send frames only to ports where the destination MAC is known to be.

Ethernet switching won't scale to anywhere near the complexity of the Internet.

freedom-fries · on Oct 18, 2023

You can't route MAC because there is no prefix matching - only exact matching. That's exactly why you need to "switch" them... and incidentally this is what your proposal accomplishes – it's equivalent to a fully-switched network. Switches (especially L3 switches) maintain port-MAC association tables to switch packets between ports and they're available off the shelf.

ooterness · on Oct 18, 2023

IP addresses have structure because a single ISP buys a contiguous block, like 123.234.*.*. A simple routing table sends that whole block to a single network port.

The table required for the whole Internet is large, but not gigabytes.

You can't route by MAC-address because it's effectively random. You'd have to store the port number for every device separately. This works fine at LAN scale, but not for the whole Internet.

em-bee · on Oct 18, 2023

MAC addresses being random is a historical accident (because of hardware limitations). today we can define them in software. and just like we have link-local addresses we could self-assign link-local MAC addresses.

and i think the self assigning protocol in link-local could even go a step further. instead of hard coding a subnet, it could detect the subnet by copying the one from its nearest neighbor. so start with a random address, talk to neighbor to learn the subnet (and netmask) in use and switch to a new address within that subnet. then possibly run DHCP and update the address again. for static addresses DHCP could identify hosts by its cryptographic host key (like the one for SSH)

when two subnets join one of them may have to adjust its prefix. more complex, but still possible.

subnet prefixes could still be assigned to organizations to avoid overlap on a global level.

i am sure i am missing some details but i think in general this could work.

sirtaj · on Oct 18, 2023

This sounds suspiciously close to re-inventing ARP and IP.

em-bee · on Oct 18, 2023

well, it's merging MAC and IP into one address. there is no need for two if the MAC address can be assigned dynamically. and it's extending the auto-discovery of the address to work over larger networks. so it's not reinventing but simplifying things. (or not, i am not familiar enough with the details to be aware of other problems that could complicate things again)

thereisnospork · on Oct 18, 2023

>You can't route by MAC-address because it's effectively random. You'd have to store the port number for every device separately. This works fine at LAN scale, but not for the whole Internet.

Not that I see any advantages to the approach but it's almost workable(?), if a little silly, at internet scale:

If every device had a 64byte ID, guesstimating 10billion people * 100 devices/head gets us a 'measly' 64TB of storage. Double that to include routing info gets us to ~128TB. A bit much to be practical, but not entirely insane either.

tsegratis · on Oct 18, 2023

Nice maths. Would each router then hold 64TB and doing a lookup per request in that volume of data would be slow

Question: how does dns lookup differ from MAC lookup. Why is domain name lookup feasible, but not MAC?

em-bee · on Oct 18, 2023

the router needs to remember where each address goes. with MAC addresses being random, there is no shortcut. DNS is distributed and you look it up one subdomain level at a time, and that can be cached. same for IP, the router only needs to store the subnet for each destination, not all ip addresses.

a central lookup database for mac addresses (which could be distributed by having separate servers for a segment of the address space) doesn't make much sense because the distance of a server to the location of the device is to great and would make updates expensive.

so the router has to remember each address used. but at least it would not have to store all addresses in existence. actually, i think the storage needs are similar to those for NAT. well, except backbone routers which have to store a lot more.

the actual problem is the initial discovery of a MAC address. where does the routing information for a MAC address come from?

you need some peer finding protocols like DHT, and those are slower.

jpleger · on Oct 18, 2023

Because aggregation, summarization and continents are a thing. Also... there are things which speak IP and don't use Ethernet for underlying communications, specifically in the network carrier and high performance optical space.

blown_gasket · on Oct 18, 2023

0C:F9:31:D2:DB:51

AB:33:C6:C6:19:74

I used a MAC address generator to get those two, but I think two is enough to make the discussion. Current reality aside, would you be able to identify those with binary math as being on the same network device, different network devices, across the world? MAC addresses on physical NICs are provided by the manufacturer, sure you can adjust them but I think that leaves the good-faith portion of this discussion.

So if you wanted to have those to communicate no matter what you would have to have a network device state: "I'm network device A, I have this device 0C:F9:31:D2:DB:51" then another state: "I'm network device B, I have this device AB:33:C6:C6:19:74". Then whenever 0C:F9:31:D2:DB:51 wants to talk with AB:33:C6:C6:19:74 it's network device will have to just send it to the next upstream network device or if there are multiple network devices that could be upstream you could send it to them all which is just not great for security whatsoever or you now have to do a recursive lookup for whatever n devices might yet be upstream and wait for a response to see if one of those has it. Overall trying to send ethernet frames globally without an IP network sounds like not a great idea.

MiddleMan5 · on Oct 18, 2023

So it seems like the primary use of IP, as you describe, is to define a way to narrow the search to sub address groups so as to not require enumerating every address in the scheme.

Still, there's doesn't seem to be any reason you couldn't just say "device 1 gets MAC 00:00:00:00:00:01" and "device 2 gets 00:00:00:00:00:02" and the gateway controller gets :::00 and there's a special address on :::FF that can be used to talk to everyone...

Is that it? Is that all there is to IP? A loose pattern for reducing search scope, a couple reserved addresses for special cases, and a balance between address bitsize and total number of unique addresses (without requiring additional routing complexity)?

It all seems so... simple

vidarh · on Oct 18, 2023

You could. Assuming all your equipment supports setting the MAC, and you make sure to operate on prefixes so you can route by prefix. There's nothing stopping you from doing so.

The reason we don't is because at the time IP was introduced, there were many alternative physical layers in active use. And while Ethernet is near ubiquitous now, what we learnt from that was that it is unreasonable to assume that all your data will go over the same physical layer. And so you need a standard addressing format that will work elsewhere too.

Nothing stops you from stripping it back locally and using MAC addresses for everything internal to you, and ditching IP, and "just" gateway to/from IP. Lots of people did gateway between different protocols before IP became the dominant choice.

But you won't get everyone else to change because it'd require new firewall and new routers, and all kinds of software rewrites, and you can see how long the IPv6 transition has taken, so you'd still need to wrap and unwrap TCP/IP and find a way to address IP for everything that isn't 100% local, and even for lots of local-only stuff unless you want to rewrite everything.

There would be potential ways. E.g. you could certainly use a few bits to say "this is external" and then have some convention to pack an IPv4 address into the MAC or let an IPv6 address overflow into the data, and use that to make gatewaying and routing to external networks easier, while everything else just relies on the MAC. But you'd still need a protocol header for other things too, and then the question is how much benefit you would gain from ditching pretty much just ARP, which isn't exactly complex, a lookup table, and replacing the IPs in the header with just a destination MAC. Because the rest of the complexity is still there.

And you can gain most of the benefit of that by getting an IPv6 EUI64 address [1]. They'll work with "normal" IP equipment, and you can optimize in your own software by having the IP stack ditch ARP lookups when they see a local EUI64 address. Whether that optimisation actually makes a difference is another question.

[1] https://community.cisco.com/t5/networking-knowledge-base/und...

coryrc · on Oct 18, 2023

It starts out simple :-).

Then you realize doing some action ends up being O(n^2) so you add some workaround in your switch and cache some things. And you know what they say about cache invalidation. And vendor A implemented it wrong in 1993 so you have a special case for their systems. And then you want to handle abuse cases. And authentication. And you're competing against the whole rest of the world and your thing isn't enough better.

kortilla · on Oct 18, 2023

Then how do you send traffic to device1 on another network? You need globally unique addresses and hierarchy. Go back to the drawing board and come back when you’ve ended up inventing a worse IP protocol.

> It all seems so... simple

Because you haven’t even thought through basic use cases.

stingraycharles · on Oct 18, 2023

You would need to structure mac addresses in such a way that they can be easily grouped for routing a-la IP subnets.

It just isn’t suitable for this.

tenebrisalietum · on Oct 18, 2023

MAC is just one way to identify ("address") directly connected/visible nodes on a network. Not all L2 technologies use MAC addresses.

- "Directly connected/visibile" means node X can contact node Y simply by throwing something on the medium (wire, radio, etc.) and doesn't have to knowingly send to a middleman (router).

When Ethernet was invented in the early 80's there were a lot more L2 technologies. Most are uncommon now (Frame Link DLCIs I think fall in this category, and PPP/dialup was common at one time - no MACs there) except for one: I don't think the cellular network uses MAC addresses at all. I could be wrong with newer 4G/5G stuff which overlaps with Wi-Fi in various places.

pests · on Oct 18, 2023

> I'm certain there are reasons IP came to live alongside/on top of MAC

There were different teams/universities working on what today we would call LAN and WAN. I forget the details and history (I'm sure someone here, who was involved, could chime in, hah) and might have this wrong, but the result is LAN networking is MAC based while WAN networking is IP based.

It's one of those accidents of history that things are just the way they are and many don't question it. I run into it a lot describing basic networking concepts or early cisco material when people ask _why_ both MACs and IP addresses exist and its just... not always the correct time to explain those details to them.

mdekkers · on Oct 18, 2023

We used to have something like that and didn’t like it https://networkencyclopedia.com/netbios/

vidarh · on Oct 18, 2023

The fact that there used to be a lot of alternative lower level network layers is incidentally also the best argument for IP: We needed a common shared layer because it was shit trying to gateway between multiple different protocols that user-level software had to know about. And as much as ethernet is dominant now, it's still not the only thing.

mgaunard · on Oct 18, 2023

All of cloud/web tech is about reinventing the wheel at upper layers. These days they're rebuilding everything on top of HTTP.

eecc · on Oct 18, 2023

Heh, what a coincidence. Just about yesterday I was chatting with a friend about the similarities between Service Mesh and the ESB of yore.

Animats · on Oct 18, 2023

It's been done. There's XNS, and there's QNX networking over raw Ethernet. Both worked well. Look them up. You probably don't want to go that route, but it is technically possible on a LAN.

xorcist · on Oct 18, 2023

There's also Audio-over-Ethernet, which is still used for professional audio where latency must be held low.

I believe there are other broadcast protocols that runs on Ethernet for the same reason.

Fibre Channel-over-Ethernet used to be a thing too, but I haven't seen it for a while. Perhaps latency wasn't as much of an issue as people thought and it lost to iSCSI.

I don't think it necessarily is bad idea to run protocols directly on the data link layer. The fewer parts the better. It's just that somewhere someone probably wants to route it, and the more general usage tends to win. So it's always going to be a niche market where latency is really important.

dezgeg · on Oct 18, 2023

eCPRI, or essentially radio signal over ethernet (ie. from base station indoors to radio module at top of the tower/roof) is another interesting use case (with even stricter latency requirements than audio).

freedom-fries · on Oct 18, 2023

Not first, but eventually the layers are needed to solve problems that evolve when your software needs to operate in someone else's network.

That said, it's not very hard to directly talk "ethernet" using raw sockets. Here's an example if you're interested: https://gist.github.com/austinmarton/1922600.

elcritch · on Oct 18, 2023

Move to ipv6 and drop ARP and DHCP, which eliminates a good chunk of the older cruft. IPv6 builds everything on top of multicast support which is required for IPv6 switches/routers. It's so much cleaner. You could even avoid DNS if you really want.

fragmede · on Oct 18, 2023

ATA over Ethernet (AoE) was one technology that forewent those layers for performance reasons.

ascar · on Oct 18, 2023

When I read the title I thought this is going to be an ethernet competitor similar to Infiniband.

hydroid7 · on Oct 18, 2023

You're my man, I'm thinking about the same! E.g. Look at TIPC.

DanielHB · on Oct 18, 2023

take a look at CBOR if you want a little more power in your payloads. It is basically binary space-efficient JSON

NavinF · on Oct 18, 2023

Did you reply to the wrong thread?

StillBored · on Oct 18, 2023

Hmmm, so much of this looks like an attempt to solve the problems that were solved with fibre channel a couple decades back. Which I guess is standard NIH, with the advantage of not having to pay the FC consortium 95% HW margins.

But still, you would think that some of those lessons could be learned before replacing it. AKA FC routes IP as one of its many protocols on top of the lower levels providing far more service guarantees than one normally gets with ethernet. Much of the QoS/latency/etc metrics were designed into FC from the beginning as a use on storage area networks (SANs). It just never took off as a IP transport because it cost 10x as much as ehernet, including a decade ago when these same groups tried to dump it on an ethernet MAC only to discover that it requires special switches which were $$$$ because "enterprise markup" defeating the whole point of cheap ethernet phy's. See FCoE..

And yet today, there is NVMEoF on FC, which is what one runs when its important that someone scp'ing a file on your network doesn't cause your database queries to slow down.

What I don't get is why OCP doesn't just actually build some of these adapters/etc with a "we won't be greedy" take and sell them not only to the hyperscalers but on the open market. That way someone could actually build say, a FC adapter that has a price similar to an ethernet adapter.

wmf · on Oct 18, 2023

Maybe a better example would be Infiniband which is a simple and efficient protocol... but it's basically owned by Nvidia. For whatever reason Broadcom won't make Infiniband ASICs and Google doesn't want to be locked in to Nvidia so they have to use Ethernet.

Double_a_92 · on Oct 18, 2023

Is it just me getting older / less smart, or did articles about products really start to sound like a jumbled mess or buzzwords lately?

What is "Hardware transport", what is "the ecosystem"? And then there is dozens of random products and technologies that I've never heard of...

This sounds more like a humble brag, than an article trying to inform people about technologies that might actually be useful to them.

stonogo · on Oct 18, 2023

"Hardware transport" is kind of a misnomer, because this is a networking protocol. It just happens to be a networking protocol that requires hardware acceleration on the NIC. We already had that in things like Infiniband and Omnipath, but Nvidia bought one and Intel rugpulled the other. Meanwhile ethernet has been approaching parity in terms of throughput, but TCP introduces unpleasant latency, so this is Google's NIH-flavored DIY on the topic. It's something of a rite of passage for a company to convince itself that building this sort of thing in-house is necessary, and that it will revolutionize high-performance computing in all the ways that previous, nearly-identical projects have not.

"The ecosystem" is The Open Compute Project [1], a trade association which mostly puts together quasi-standards that provide targets so computer manufacturers can produce bleeding-edge gear with some hope that it will be interoperable. An example OCP production is the newer 21" racks that are starting to appear in datacenters.

1 - https://www.opencompute.org/

jacquesm · on Oct 18, 2023

> It's something of a rite of passage for a company to convince itself that building this sort of thing in-house is necessary, and that it will revolutionize high-performance computing in all the ways that previous, nearly-identical projects have not.

The same happens with:

- databases

- encryption

- operating systems

- frameworks

- programming languages

I've seen this so many times by now it stopped being funny.

jeffbee · on Oct 18, 2023

Are you suggesting that it would have been better for Google to use off-the-shelf databases etc? Because at their scale it seems clearly necessary to bring that in-house.

jacquesm · on Oct 18, 2023

At Google's scale a lot of the ordinary limitations do not apply. Unfortunately many companies believe that because Google does it it must be good. It's cargo cult reasoning and the result is endless NIH projects.

As for Google's 'Falcon' project: it smacks of NIH to me, but maybe their use cases are specific enough that none of the off-the-shelf bits were usable.

jeffbee · on Oct 18, 2023

Which part of falcon, specifically, do you think duplicates off-the-shelf software that accomplishes the same things? Swift? Carousel? PLB?

jabl · on Oct 18, 2023

There seems to be quite a lot of overlap between Falcon and what the Ultra Ethernet Consortium is ostensibly working on. As well as Amazon's Scalable Reliable Datagram (SRD) thing. All of them, in a way, are about addressing deficiencies in RoCEv2 for large scale latency sensitive networking that you see in HPC and DL training.

But none of these are things you can buy today. Well, there's InfiniBand, but if you're wed to Ethernet..

jeffbee · on Oct 18, 2023

I think it's a mistake to assume that an organization this large and sophisticated simply failed to try RoCE. They probably gave it a go but it didn't work out for some technical or economic reason.

jabl · on Oct 18, 2023

I'm sure all the big players in this space have tried RoCE, and found it wanting. Hence all these new approaches.

jacquesm · on Oct 18, 2023

I think you are reading something into my comment that I did not actually write.

seanp2k2 · on Oct 18, 2023

This is how you get promoted at Google.

nottorp · on Oct 18, 2023

Btw, now that someone got promoted, how dependent is this on Google's continued support?

As we all know, interest at Google will now wane since you can't get promotions out of this any more.

up2isomorphism · on Oct 18, 2023

When you have enough scale you can claim a certain particular way of doing things are better than the others, which in most cases is just one way of doing things. This is what we see here.

DoingIsLearning · on Oct 18, 2023

Same thing goes for their single monolith repository philosophy.

deely3 · on Oct 18, 2023

Could you elaborate on this please?

DoingIsLearning · on Oct 18, 2023

There is a fairly amount of technical writing describing Google's use of a single repository across a lot of enginering teams.

This works for them at their scale with quite a lot of internal tooling specifically to make it work.

However many people defend it as _the_ solution whereas it is a way with trade offs like most solutions.

IshKebab · on Oct 19, 2023

Microsoft, Facebook and Twitter all use a monorepo too. It's not just Google.

Granted, Facebook have written their own VCS and Microsoft heavily modified Git to make it usable with monorepos (but only on Windows).

Unfortunately stock Git is bad at multirepos and monorepos. When you have "hundreds of people working full time on a project" scale, stock Git doesn't have a good answer.

mgaunard · on Oct 18, 2023

To this day I still haven't seen a more sensible API for low-latency Ethernet than Exablaze (was the market leader in low-latency trading, then got bought by Cisco).

The only thing blocking these from becoming standard is that it means userland has direct control of hardware.

DrReachAround · on Oct 18, 2023

I'm confused by this because we've been using Falcon at work for over a year now, perhaps longer, as I just started a year ago. What are they making available that wasn't already?

sargun · on Oct 18, 2023

What were you using Falcon in, and for what?

smokefoot · on Oct 18, 2023

I don’t understand networking all that well. Is it interesting that the telcos and non-tech companies are moving away from specialized hardware toward software defined networks while the hyperscalers are using hardware acceleration?

wmf · on Oct 18, 2023

"Software defined networks" doesn't mean anything. If you want to understand networking you have to strike it from your vocabulary.

It's possible for different vendors to be on different points on the wheel of reincarnation at the same time. https://www.computerhope.com/jargon/w/wor.htm

smokefoot · on Oct 18, 2023

Ha, I get it. That is a helpful model.

tommiegannert · on Oct 18, 2023

SDN just means reconfiguring things that used to be manually configured on-the-fly.

E.g. instead of your little server setting firewall rules locally, it tells the router what traffic to allow. That router, in turn, tells upstream about its needs, and so on. Or a server reports its load, and the routers do active load balancing. The hardware wires are still there, as always.

Re. hardware acceleration, I think the earliest form of this was moving the checksum computation [1] from the CPU to the network device, even though the networking device didn't really know about the protocol it was doing the checksum for.

In both, it's just about parallelizing workloads, the same we do with microservices at the upper levels of the stack. Natural progression of distributed systems, with fancy names attached.

[1] https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...

dilyevsky · on Oct 18, 2023

Nobody wants to depend on hardware development cycle hence “fuck it, ill do it in software”

duped · on Oct 18, 2023

I don't think it's that hard to rely on hardware development, it's more of a problem of rolling out a fleet of new hardware.

It's just not realistic to take all the switches, routers, and other garbage you've got in between points in the network off the rack/ceiling/wall/pole because the hardware can't support some protocol.

Good evidence for this is the rollout of fiber, which has been happening neighborhood by neighborhood and house by house for a decade.

dilyevsky · on Oct 18, 2023

I was mainly talking about your own dc but yes if you need to traverse public infra it’s a complete non-starter. But also it’s not like middle boxes offer you any sort of sdn api - you still need to overlay

caskstrength · on Oct 18, 2023

> Is it interesting that the telcos and non-tech companies are moving away from specialized hardware toward software defined networks while the hyperscalers are using hardware acceleration?

Their SDN implementations are also hardware accelerated.

eightysixfour · on Oct 18, 2023

These things are partially related - telcos running things in the cloud, in software, is actually running on top of these hardware innovations, it is just abstracted from them.

jbotdev · on Oct 18, 2023

It sounds like this builds on top of Ethernet to provide a higher performance alternative to UDP/TCP, with some sort of hardware acceleration.

I may be in over my head since I’m not an HPC/datacenter expert, but not sure I understand how you’d use this on the software side. Maybe someone is aware of specific examples? (beyond the vague “HPC/AI”)

edit: as another comment mentioned, the diagram shows it’s on top of UDP/IP, so it’s mostly an alternative to TCP/IP

wmf · on Oct 18, 2023

From a software perspective you probably wouldn't see Falcon at all. You'd use, say, the RDMA verbs API and under the hood it's accelerated by Falcon.

adsharma · on Oct 18, 2023

Can anyone explain the difference between Falcon and RoCE v2?

wmf · on Oct 18, 2023

It sounds like Falcon adds encryption, packet spraying, and better congestion control.

packetslave · on Oct 18, 2023

fewer Pause Frames of Doom?

rstuart4133 · on Oct 19, 2023

I normally like Google blog announcements, as they are usually heavy on technical details. But not this one. Quoting, the meat of it is:

> Fine-grained hardware-assisted round-trip time (RTT) measurements with flexible, per-flow hardware-enforced traffic shaping, and fast and accurate packet retransmissions, are combined with multipath-capable and PSP-encrypted Falcon connections ... flexible ordering semantics and graceful error handling ... hardware and software are co-designed to work together to help achieve the desired attributes of high message rate, low latency, and high bandwidth

So like QUIC, but designed for low latency. Maybe. There is no indication of how they achieve it if it is, nor is there a link to further details. The bulk of the article is literally name dropping. Protocol names, FAANG company names, standards organisation names. It reads like C-suite bait. "Come join us boys - all the big guys already have. So it's a sure winner."

axegon_ · on Oct 18, 2023

Sigh another project called "falcon".

docandrew · on Oct 18, 2023

I was confused by the reference to “lossy” networks in this page. Does this have a different meaning in this context than something like lossy compression where data is actually discarded?

rincebrain · on Oct 18, 2023

Ethernet, unlike, say, Infiniband, doesn't promise things will get where you sent them just because it didn't error initially, so other protocols handle this at higher levels to notice the failure cases.*

For an example of what this means, try setting your MTU above the limit, and watch the raw traffic.

* - it's been years since I cared about the formal definition, my apologies if I got it wrong.

networkchad · on Oct 18, 2023

You are correct in general. Lossy = no guarantee mechanism / state-based acceptance; it’s handled by a higher level proto (tcp).

spoonjim · on Oct 18, 2023

All networks are “lossy” because any cable can be cut, etc.

A “lossy” protocol is one that doesn’t attempt to compensate for that. In most cases, but not all, that means a higher level protocol will need to ensure that every bit of data has made it through. (An example protocol that might not care is one for watching broadcast TV on the Internet… if you miss a few seconds it’s not a big deal).

wmf · on Oct 18, 2023

It means packets may be dropped, usually due to congestion.

drewtato · on Oct 18, 2023

I think it means error correction would be handled by a higher level.

KRAKRISMOTT · on Oct 18, 2023

Infinilink?

jiggawatts · on Oct 18, 2023

I guarantee that there will eventually be a vaguely similar (but different!) stack published by each of: NetFlix, Microsoft, Amazon, and Apple. Just kidding, Apple won't publish anything.

The IT ecosystem has fragmented into mutually incompatible cliques. You are either in the Google ecosystem, the Amazon ecosystem, or some other one, but there are no more truly open and industry-wide standards.

Look at WebAuthN: it enables a mobile device from "any" vendor to sign on to web pages without a password. Great! Can I transfer secrets from an Apple iPhone to a Google Android phone? Yes? No? Hello? Anyone there?

I just got a new camera. It can take HDR still images, which look astonishingly good. Can I send that to an Apple device? Sure! Can I send it to a Google device? Err... not without transcoding it first... on a Microsoft Windows box. Can I send it to a mailing list of people with mixed-vendor devices? Ha-ha... no.

This is the best argument I've seen for splitting up the FAANGs + Microsoft + NVIDIA. Once they get to this behemoth trillion-dollar scale, they become nations onto themselves and no longer need to cooperate, no longer need to use any open standards at all, and can start dictating and pushing third parties around.

Another random example is HTTP/3, which is basically the "What's best for Google" protocol.

Or gRPC, which is "What Google needs in their data centre".

And now Falcon, which is "The transport Google needs for their workloads".

Does it work for anyone else? I don't know, but it's a certainty that Google doesn't care and never will, because they don't need to.

tgma · on Oct 18, 2023

This is exaggerated to the point that I consider it fiction. BTW Google doesn't substantially use gRPC within their datacenters.

The industry has always been this way. Back in the day there were many many processor ISAs that are now consolidated. There was been many networking standards that consolidated (IPX/SPX anyone?) New things often diverge because of new requirements not out of spite. There is a push and pull between standardization and innovation. Doesn't make it particularly unhealthy unless you can point to specific metrics and compare trends throughout the long arc of time.

bhawks · on Oct 18, 2023

> This is exaggerated to the point that I consider it fiction. BTW Google doesn't substantially use gRPC within their datacenters.

For explicitness Google uses stubby which shares a lot of interface level commonality with gRPC but there's differences at a runtime level. Nobody is slinging json or soap around Google data centers.

tgma · on Oct 18, 2023

And gRPC is basically HTTP/2. Difficult to get more standards-friendly than that.

I suppose when some people advocate for "standards" they just mean the shitty systems prevalent in the industry they happen to have been used to.

fragmede · on Oct 18, 2023

> For explicitness Google uses

Japanese Bullet Train technology

marcosdumay · on Oct 18, 2023

> Another random example is HTTP/3, which is basically the "What's best for Google" protocol.

HTTP/2 is that. The version 3 is... quite great, and not really originary from there.

And gRPC is some self-contained thing that doesn't bother anybody.

I do really agree with your point. But those examples are bad, and yet another non-standard cloud service isn't also a good example.

petters · on Oct 18, 2023

It’s not from there? HTTP/3 is QUIC, right?

marcosdumay · on Oct 18, 2023

Oh you are right, QUIC is from there. I misremembered it.

speed_spread · on Oct 18, 2023

And QUIC/HTTP3 was invented by Google to deliver more ads, faster. Yay.

vore · on Oct 18, 2023

This is like saying planes were invented to deliver you to more businesses, faster.

speed_spread · on Oct 18, 2023

At least I enjoy taking the plane and I decide when to take it. Planes serve a purpose. Adtech is just parasitic. An Internet without ads wouldn't need QUIC.

jiggawatts · on Oct 18, 2023

For example, HTTP/3 uses QPACK, which has a static predefined list of HTTP header strings included in the standard.

What strings you ask?

The top hits Google captured from their data centre egress.

E.g.: https://www.ietf.org/archive/id/draft-ietf-quic-qpack-20.htm...

Notice that "content-encoding" includes br (Brotli), a Google compression algorithm that essentially only they were using at the time.

tgma · on Oct 18, 2023

I don't understand what the objection is to the methodology. Are you claiming there is another party that has a better sample (even subjectively so) and was pushing it and didn't succeed? If anything, standards committees are often overly annoying and biased the other way just because some other company representative wants to justify their presence. That is also cherry picked. For example standardizing ALPN over NPN which is largely a downgrade for the average user done in the standard process. The examples you use simply indicate a certain company is ahead of the game in solving problems others also have.

toast0 · on Oct 18, 2023

Not sure why Netflix would be in your list. AFAIK, they run their cloudy stuff on AWS, which isn't too unusual; chaos monkey is neat though? Their CDN boxes are exotic because they run a lot of sessions at relatively pedestrian bandwidths and a lot times 5-20Mbps adds up to a huge number. There's real work there and it's impressive, but it doesn't need exotic network protocols. Bulk encryption offloading NICs are super handy for their use case, which is certainly somewhat exotic.

They haven't said much lately about sending content updates to their CDN nodes, but I think the throughput requirements on that isn't as high.

Brajeshwar · on Oct 18, 2023

I agree — the corporates have stop building and making `for` their users. They make their services so the users have no option but to get more and more comfortable in their ecosystem, and never get out.

Very soon, we will have providers/companies/champions/fighters that keep building the middleware transports to connect their behemoths.

Btw, someone somewhere popped up a better term — AGAMEMNON (Apple, Google, Amazon, Microsoft, Ebay, Meta, Nvidia, OpenAI, Netflix)

unnouinceput · on Oct 18, 2023

OpenAi is Microsoft, so that term not really works.

spoonjim · on Oct 18, 2023

This criticism is a little weird. This is an internal protocol in Google’s datacenters. Why should it have to be compatible with anything?

stonogo · on Oct 18, 2023

Because the whole point of the linked article is that they're making it part of the Open Compute Project, whose entire existence is devoted to making sure things are compatible with other things.

Accujack · on Oct 18, 2023

>The IT ecosystem has fragmented into mutually incompatible cliques. You are either in the Google ecosystem, the Amazon ecosystem, or some other one, but there are no more truly open and industry-wide standards.

This is one excellent example of the reason that increased/renewed anti-trust actions by the FTC are necessary.

KingLancelot · on Oct 17, 2023

Replace “The Ecosystem” with “The Jungle”.

Edit: It’s yet another meta protocol built on top of TCP/UDP.

junon · on Oct 18, 2023

I want to see more layer 4 protocols, not stuff built on top of them, personally. That's where things get interesting.

wmf · on Oct 18, 2023

There's no interesting distinction between a "native" transport protocol and a transport protocol running on top of a UDP shim. The UDP header is probably necessary for ECMP.

StillBored · on Oct 18, 2023

? Maybe i'm misunderstanding what your trying to say, but there are major differences between an ethernet + IP transport and other transports like fiber channel (or even token ring, atm, etc) which has built into the lowest layers buffer crediting / flow control, retransmission, prioritization, etc. Sure you can build much of that higher in the stack but it requires everything in the network to be playing the same game to assure QoS metrics, and if that's the case you don't really have a normal IP network anymore.

wmf · on Oct 18, 2023

By transport protocol I mean layer 4. FC/TR/ATM/IB are (mostly) layer 2 protocols.

I think the idea is that Falcon assumes the underlying network is semi-crappy and works around that (e.g. Falcon assumes that packets arrive out of order then it puts them back in order).

gymbeaux · on Oct 18, 2023

Honestly I think there’s a lack of talent for that, not only to develop the protocol, but to support it. How many of us really understand OSI?

luma · on Oct 18, 2023

OSI is just a model, there’s not all that much to understand there. It’s simply a way to categorize network technologies using a common language.

cjensen · on Oct 18, 2023

OSI is an entire networking stack designed by committee that died from disuse [1]. The only thing we now remember of it are the functional layers within the protocol.

For example, people sometimes refer to TCP as a "Layer 4" protocol even though (a) TCP predates the invention of Layer 4 and (b) TCP is a square peg that does not exactly fit into the round hole that is Layer 4.

[1] https://en.wikipedia.org/wiki/OSI_model

AlotOfReading · on Oct 18, 2023

I wish we could forget the other remnants of OSI like x.509, ASN.1, and LDAP. Those were the things good enough to be used for real systems, and I'd still rather crawl over broken glass than implement any of them.

Imagine how bad the rest of it was.

jsnell · on Oct 18, 2023

> Edit: It’s yet another meta protocol built on top of TCP/UDP.

What are you basing that on? Nothing in the article implies that as far as I can see.

johncolanduoni · on Oct 18, 2023

The diagram in the article shows it layered over UDP.

ianburrell · on Oct 18, 2023

The diagram shows it layered on top of RDMA and NVM Express, and supporting UDP and IP. Unless you think the reverse makes sense. The diagram is just upside down.

wmf · on Oct 18, 2023

It's not upside down. RDMA and NVMe run on top of Falcon which runs on top of UDP.

ianburrell · on Oct 18, 2023

RDMA is a low-level physical transport. You are saying that they are going to emulate RDMA on top of Falcon. Are they going to run Ethernet and then IP on top of that?

The diagram is confusing since it upside down to layering direction, but the article is clear. Falcon is a hardware transport protocol, replacing Ethernet. Like Ethernet, it runs on top of physical transport like RDMA. And IP runs on top of Falcon and Ethernet.

stonogo · on Oct 18, 2023

RDMA stands for Remote Direct Memory Access. The version that runs on ethernet is known as RDMA over Converged Ethernet, or RoCE. Until recently, the most common hardware interconnect used to support RDMA was Infiniband. Omnipath existed for a while.

The point is: RDMA is just a term for a computer reading memory on another computer without involving the operating system. The interconnect hardware and physical transport must support RDMA but that doesn't mean it is RDMA, as there are several different implementations of RDMA, and each kind of hardware supports a different subset of those implementations.

wmf · on Oct 18, 2023

RDMA is a low-level physical transport.

It's not. For example, RDMA can run on top of TCP (iWarp). NVMe can also run on TCP. Now replace TCP with Falcon.

dur-randir · on Oct 18, 2023

And closes in ..?

bluGill · on Oct 18, 2023

Google has lost trust with so many of the things they have released in the past becoming unsupported and obsolete.

Grazester · on Oct 18, 2023

If this is an open standard then anyone who uses it and wishes to support it can do so no, or am I mistaken?

bluGill · on Oct 18, 2023

Maybe, but often in practice if the amount of supported needed gets high. It needs some very involved people to keep support going, otherwise it will bit rot and fail to function after a time.