For languages like C, C++, and Rust, the bottleneck is going to mainly be system calls. With a big buffer, on an old machine, I get about 1.5 GiB/s with C++. Writing 1 char at a time, I get less than 1 MiB/s.
#include <cstddef>
#include <random>
#include <chrono>
#include <cassert>
#include <array>
#include <cstdio>
#include <unistd.h>
#include <cstring>
#include <cstdlib>
int main(int argc, char **argv) {
int rv;
assert(argc == 3);
const unsigned int n = std::atoi(argv[1]);
char *buf = new char[n];
std::memset(buf, '1', n);
const unsigned int k = std::atoi(argv[2]);
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < k; i++) {
rv = write(1, buf, n);
assert(rv == int(n));
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = stop - start;
std::chrono::duration<double> secs = duration;
std::fprintf(stderr, "buffer size: %d, num syscalls: %d, perf:%f MiB/s\n", n, k, (double(n)*k)/(1024*1024)/secs.count());
}
EDIT: Also note that a big write to a pipe (bigger than PIPE_BUF) may require multiple syscalls on the read side.
EDIT 2: Also, it appears that the kernel is smart enough to not copy anything when it's clear that there is no need. When I don't go through cat, I get rates that are well above memory bandwidth, implying that it's not doing any actual work:
I suspect (but am not sure) that the shell may be doing something clever for a stream redirection (>) and giving your program a STDOUT file descriptor directly to /dev/null.
I may be wrong, though. Check with lsof or similar.
There's no special "no work" detection needed. a.out is calling the write function for the null device, which just returns without doing anything. No pipes are involved.
EDIT 2: Also, it appears that the kernel is smart enough to not copy anything when it's clear that there is no need. When I don't go through cat, I get rates that are well above memory bandwidth, implying that it's not doing any actual work: