Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, network scaling bugs are the most fun. The one I liked the most was when after expanding a pool of servers, they started to lose connectivity for a few minutes and then come back a minute or so later as if nothing happened.

Turns out we accidentally stretched one server VLAN too wide, to roughly 600 devices within one VLAN within one switch. The servers had more-or-less all-to-all traffic, and that was enough to generate so many ARP requests and replies that the switch's supervisor policer started dropping them at random, and after ten failed retries for one server the switch just gave up and dropped it from the ARP table.

Of course the control plane policer is global for the switch, so every device connected to the switch was susceptible, not just the ones in the overextended VLAN.



vlans are convenience that is the enemy of performance and undertandability.


They're a great alternative to having to go down the datacentre mines and replug a few thousand cables, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: