Hmm… it seemed to run at line speed on our machines. I’m also not sure where you’re getting two calls for sha256 from? Like 1 to derive the key (which is sha256 on a very small amount of data) and the second is?
HMAC requires two calls to the underlying hash function. In this case one is with a block-size input and the other is smaller (key size plus output of the first call). When called per-block this approach is much slower than any modern AEAD (which typically requires simple polynomial math on each block plus a single AES/ChaCha/whatever finalization call).
It might be “fast enough for line rate” in your situation but even then you could be saving CPU cycles for other work by using a more efficient construction.
As in it's processing at the speed that data can be fed into the CPU. This particular use case was files coming in from the network on 10Gbps hardware but that was about the speed the AES HW ran via openssl perf tests. How many sessions and message sizes are irrelevant. Hardware was AMD EPYC 7642.
If there’s no HW that demonstrates a speed difference, then maybe the theoretical CS concerns aren’t properly modeled? Also, the approach I outlined has a strength whereby there’s no nonce to mismanage which is a big strength.