> If 1% of users can't send messages, then that should count as a full-blown out...

> If 1% of users can't send messages, then that should count as a full-blown outage and should start counting against whatever SLA they advertise.

Google published a paper last year describing this approach to measuring uptime: https://blog.acolyer.org/2020/02/26/meaningful-availability/

The idea is to define availability as "the probability that the site 'appeared' to be down for a random user, averaged over a time window of size w". You can choose a particular value of w and look at trends over time, or you can plot availability as a function of w to understand patterns of downtime.