I think the difference is in how people define "async I/O". When saying "Go doesn't use async I/O", what that really expands to is: "While Go can use epoll[0], it doesn't abstract over epoll optimally." i.e. There is a difference between "zero-overhead async-I/O" and goroutines.
I'm not an expert but I'll try to describe why goroutines would have overhead (Someone correct me):
The posterboys of async I/O are Nginx and Redis. I'm probably simplifying, but this is the basics of how they work are: When using epoll directly, the optimal way to store state per connection is using a state machine. The state machine is usually some C-like struct which the compiler can give a fixed size. Each state machine is then constructed to have X memory, and is unable to grow. In theory (I don't believe anyone actually does this), if the program was comfortable with some fixed connection limit, you could fit all of these state machines on the program's stack, and require no heap allocations.
Meanwhile, Go's routines have stacks which can grow. Each goroutine has some max size, and some initial size, both of which are pre-set by the Go-lang team (I'm sure they are configurable). Since the stack can grow: They have to be heap allocated, and need to be either segmented or copied when they need more space[1]. Additionally, there is "internal fragmentation" because a growable stack needs to be consistently overallocated, which is a "waste" of memory.
Very quick Googling suggests that Lua has growable stacks as well.
[0] FWIW: Go could use M:N scheduling of kernel threads to achieve goroutines. Which is another reason why saying goroutines are async I/O could be incorrect. I don't know how its actually implemented.
I'm not an expert but I'll try to describe why goroutines would have overhead (Someone correct me):
The posterboys of async I/O are Nginx and Redis. I'm probably simplifying, but this is the basics of how they work are: When using epoll directly, the optimal way to store state per connection is using a state machine. The state machine is usually some C-like struct which the compiler can give a fixed size. Each state machine is then constructed to have X memory, and is unable to grow. In theory (I don't believe anyone actually does this), if the program was comfortable with some fixed connection limit, you could fit all of these state machines on the program's stack, and require no heap allocations.
Meanwhile, Go's routines have stacks which can grow. Each goroutine has some max size, and some initial size, both of which are pre-set by the Go-lang team (I'm sure they are configurable). Since the stack can grow: They have to be heap allocated, and need to be either segmented or copied when they need more space[1]. Additionally, there is "internal fragmentation" because a growable stack needs to be consistently overallocated, which is a "waste" of memory.
Very quick Googling suggests that Lua has growable stacks as well.
[0] FWIW: Go could use M:N scheduling of kernel threads to achieve goroutines. Which is another reason why saying goroutines are async I/O could be incorrect. I don't know how its actually implemented.
[1] https://blog.cloudflare.com/how-stacks-are-handled-in-go/