There's quite a few things you've missed that are significant and should have been included, maybe one for part two:
* Network ACLs, which describe the ruleset (consider it like a stateless firewall) for subnets and their respective routes. Whilst they are optional, having a default set it straightens out a lot of duplication that may end up in Security Groups (which are more stateful in nature).
* Elastic (public) IPs. NAT instances/gateways require their use, and there is dance to be done around their allocation in account, and attaching to instance interfaces.
* IPv6 components. Egress-only Internet Gateways operate differently to IGWs, as there is no NAT they need a route applied across all subnets both public and private. IPv6 CIDR which allocates the VPCs /56 (and thus each subnet gets a /64, and each instance's interface thus gets a /128 which is bananas, but IPv6 is a second class citizen on AWS). Finally updating the subnets so automatic IPv6 address assignment happens.
* VPC Gateways - these are broken into two types, the older type that support S3/DynamoDB and effectively allow traffic in a public/private subnet to bypass NAT. These enabled can have significant advantages to access and throughput. The newer "PrivateLink" services are different and having pricing costs associated with them.
* DNS and DHCP: It's a rule in the VPC that the delegated resolver lives on ".2" of the VPC's CIDR, and operates in dual-horizon - EC2 hostnames setup accordingly resolved by instances inside the VPC will get the private VPC CIDR address, not any Elastic IP.
Network ACLs have been pointed out as missing from this before but quite a few people said that they were right not be included. I didn't put them in because I've never used them so didn't fall under 'need to know' from my perspective.
IPv6 is another point of contention but again it's not something I've ever used and so, apart from any other controversies with it ("...IPv6 which is only marginally better than IPv4 and which offers no tangible benefit...", https://varnish-cache.org/docs/trunk/phk/http20.html), I'm not qualified to write about it.
EIPs and ENI should probably have been in there but I don't tend to use those that often either so they didn't occur to me.
I'm not sure that VPC Gateways, DNS or DHCP are necessarily need to know things either. VPC Gateways are for a specific routing optimisation which not everyone is going to need. I didn't know the details of the DNS set up for a VPC so thank you for that.
Thank you for the feedback - I really appreciate you taking the time.
I would also add VPC PrivateLinks to the list, which let you establish private connections between systems in different VPCs without having to either peer them or connect them in other ways. PrivateLinks allow you to relieve the pressure that you might otherwise feel to build a lot of systems in the same VPC.
Another useful concept (not VPC-specific) is using the Infrastructure-as-Code paradigm (e.g., CloudFormation, Terraform) to capture all of your networking configuration in source control, along with who made any changes and the reasons or design documentation for them.
Network ACLs [...] Whilst they are optional, having a default set it straightens out a lot of duplication that may end up in Security Groups (which are more stateful in nature).
I inherited an infrastructure that had NetACLs and security groups with duplicate entrypoints and policies, years of accumulated cruft because it was poorly designed and the documentation was even worse (read: nonexistent), security groups all the way down. That one threw me through a hard and annoying mental loop for a couple of hours until picking through with the finest tooth comb revealed what was going on.
The fun part is going to be rebuilding our routing in a new VPC such that it doesn't make the next guy want to put his head in a black hole.
I'd be lying if I said it wasn't a fun challenge in a sordid kind of way, though.
I guess it’s a matter of preference, but I strongly prefer security groups over ACLs, which I don’t use at all. Even if only from a compliance perspective, a security group is equivalent to a host firewall (which personally helps me with PCI - no need for iptables and windows firewall). Whereas an ACL is a bit harder to make that case with. I also find them easier to audit.
I like using ACLs for my coarse-grained "this subnet is allowed to talk to this subnet" rules, and security groups for everything finer-grained. Maybe I'm over-cautious, but I don't want one rogue security group opening up a tunnel to sensitive subnets.
Yes, this is one of the best reasons to use network ACLs. (You can also achieve this with routes)
I think the idea is that separate teams with different responsibilities can manage the two different layers. Your app team may manage the security groups but the security team manages network ACLs which limit what can go into or come out of a subnet.
I'm slightly inclined to agree, it's one of those YMMV scenarios. What happened to me was there was some unholy combination of both going on, duplicating each other, in some cases weaving in and out of each other with some bastard frankenstein topology of route tables to nowhere...
those were frightening times. Entire services would fall over, dogs and cats living together
How do folks use Network ACLs? I haven't used them personally - relying more on security groups and segmenting subnets to specific tasks (e.g. attached to public network via IGW, or private network only)
Network ACLS are quite tricky to debug. For one of my connections, network calls were failing because esp was blocked at acl layer, ACL blocks all non tcp traffic by default. Funnily, network calls with same data-center was working but was failing when calling to another data-center. I had to look at VPC flow logs to figure that non tcp protocals were being blocked.
Let me summarize further.
If you come from 20 years of application development and
network design/administration in 'real' LAN and IGRP networks with 'real' hardware you are going to be learning everything again.
These cloud end user environments are fake eggs and saccharine sweetener.
* Network ACLs, which describe the ruleset (consider it like a stateless firewall) for subnets and their respective routes. Whilst they are optional, having a default set it straightens out a lot of duplication that may end up in Security Groups (which are more stateful in nature).
* Elastic (public) IPs. NAT instances/gateways require their use, and there is dance to be done around their allocation in account, and attaching to instance interfaces.
* IPv6 components. Egress-only Internet Gateways operate differently to IGWs, as there is no NAT they need a route applied across all subnets both public and private. IPv6 CIDR which allocates the VPCs /56 (and thus each subnet gets a /64, and each instance's interface thus gets a /128 which is bananas, but IPv6 is a second class citizen on AWS). Finally updating the subnets so automatic IPv6 address assignment happens.
* VPC Gateways - these are broken into two types, the older type that support S3/DynamoDB and effectively allow traffic in a public/private subnet to bypass NAT. These enabled can have significant advantages to access and throughput. The newer "PrivateLink" services are different and having pricing costs associated with them.
* DNS and DHCP: It's a rule in the VPC that the delegated resolver lives on ".2" of the VPC's CIDR, and operates in dual-horizon - EC2 hostnames setup accordingly resolved by instances inside the VPC will get the private VPC CIDR address, not any Elastic IP.