Good question. I'm not sure how suitable this would be to then do statistical an...

Good question. I'm not sure how suitable this would be to then do statistical analysis on what remains. You'd likely want to try and aggregate at source, so you're considering all data and then only sending up aggregates to save on space/bandwidth (if you were at the sort of scale that would require that).

The use-case I chose in the post was more focusing on protecting some centralised service while making sure when you do throw things away, you're not doing it in a way that creates blind-spots (e.g. you pick a rate limit of N per minute and your traffic is inherently bursty around the top of the minute and you never see logs for anything in the tail end of the minute.)

A fun recent use-case you might have seen was in https://onemillionchessboards.com. Nolen uses reservoir sampling to maintain a list of boards with recent activity. I believe he is in the process of doing a technical write-up that'll go into more detail.