TechEmpower is (historically) a JVM shop. No surprise they do not bench the aspe...

chrismorgan · on Oct 12, 2019

It’s rather hard to meaningfully measure any of these figures.

Startup you can measure, but the optimal tuning for faster general operation may start up slower, so now you might want both figures. And then you’re getting perilously close to development considerations, so once you’re doing that surely you should benchmark any required compilation, and so on.

Warmup is even harder to handle well, because it introduces another dimension; even if you were presenting just a single-dimensional figure (which you’re not quite), you’re now wanting to consider how that varies over time while warming up, so now it’s at least two-dimensional. But unless warmup takes ages, you’ll find it hard to get statistically significant figures, because you have so many fewer samples in each time slice.

Total installed size? Problematic because the number isn’t meaningfully comparable, as this is a minimal test, and a number in that scope gives no indication of how much is due to the environment (e.g. JRE, CPython, &c.), and how much further growth there will be as you add more features. People will also start arguing about what should count; if for example you count the delta from a given base OS installation, then you’re providing an advantage to something that uses the system version of, say, Python, and penalising something that requires a different and extra version of Python.

Memory usage? Suppose you pick two figures: peak memory usage, and idle memory usage after the tests. Both are easy to measure, but neither is particularly useful. Idle memory usage, similar problem to disk usage, and so the numbers aren’t usefully comparable. Peak memory usage is perhaps surprisingly the more useless of the two, because the increase in memory usage is strongly correlated with how many requests are being served at once—and so a slower contestant might artificially use less memory than a faster one; and so if you wanted those numbers to be comparable, you’d need to throttle requests to the lowest common denominator, which could then be argued as penalising light memory usage patterns.