I'm not sure any of that matters. As long as the same thing was measured in the same way each time, I think the results are relevant.
For your first point, the measurement wasn't "usability", it was how long it took for the load event to fire.
Second, having dev tools open should affect all results the same way. The absolute number isn't important, it's the relative ordering, which should be the same whether dev tools are open or not.
The rest of your points are just nitpicking. Maybe he didn't use the absolute best way to measure CPU or memory usage, but I'm willing to accept they're good enough proxy measures that the results are relevant.
It's not a good enough measurement - it's a ridiculously poor measurement. You're trying to measure how fast an adblocker is by measuring how long it takes... to load the ads.
How is that in any way helpful? If anything, the proper, comparable load time for the adblocked version is infinite - they never complete loading the ads.
If you're going to run this comparison, you need to at least compare the same set of resources being loaded - after all, if you're willing to omit ads entirely, you clearly don't care how long it takes for them to load.
We don't need a perfect measurement, but this measurement is extremely biased. DOMContentLoaded would be better, but still far from good - you want to measure the load time of the resources the adblocker would not have blocked.
Knowing the time saved by removing the ads is a useful number to know, IMO. I'm willing to give the benchmarker the benefit of the doubt that he used the same block lists for each blocker, although spelling it out would have been a good idea.
For your first point, the measurement wasn't "usability", it was how long it took for the load event to fire.
Second, having dev tools open should affect all results the same way. The absolute number isn't important, it's the relative ordering, which should be the same whether dev tools are open or not.
The rest of your points are just nitpicking. Maybe he didn't use the absolute best way to measure CPU or memory usage, but I'm willing to accept they're good enough proxy measures that the results are relevant.