Wild guess: visiting a private sub requires an extra call to a service/db to che...

bakuninsbart · on June 12, 2023

I never worked at this scale, but could it also be that different subs are horizontally scaled and with so many people reverting to the subs that are still open the load is unevenly balanced?

q7xvh97o2pDhNrh · on June 12, 2023

Good question! And few people get to work at this scale, so it's not an unreasonable guess. I'll join you in speculating wildly about this, since, hey, it's kind of fun.

IMHO sharding traffic by subreddit doesn't pass the smell-test, though. Different subreddits have very different traffic patterns, so the system would likely end up with hotspots continuously popping up, and it'd probably be toilsome to constantly babysit hot shards and rebalance. (Consider some of the more academic subreddits vs. some of the more meme-driven subreddits — and then consider what happens when e.g. a particular subreddit takes off, or goes cold.)

Sharding on a dimension that has a more random/uniform distribution is usually the way to go. Off the top of my head (still speculating wildly and basically doing a 5-minute version of the system-design question just for fun), I'd be curious to shard by a hash of the post ID, or something like that. The trick is always to have a hashing algorithm that's stable when it's time to grow the number of shards (otherwise you're kicking off a whole re-balancing every time), and of course I'm too lazy to sort that out in this comment. I vaguely remember the Instagram team had a really cool sharding approach that they blogged about in this vein. (This would've been pre-acquisition, so ancient history by Silicon Valley standards.)

As for subreddit metadata (public, private, whatever), I'd really expect all of that to be in a global cache at this point. It's read-often/write-rarely data, and close-to-realtime cache-invalidation when it does change is a straightforward and solved problem.

adra · on June 12, 2023

For really really large sets, you'll still eventually want to reduce read compute costs by limiting specific tenants to specific shards in order to reduce request fan out for every single read request. If say I get a super quiet forum, would it make sense to query 2 shards or 6000? Clearly there's a loss of performance when all read requests have infinite fan out.

dbanon9 · on June 12, 2023

A good, but wrong, assumption is to assume Reddit's engineers know what they're doing.

The founding Reddit team was non-technical (Even Spez. I've been to UVA; it's not an engineering school; and spez had never done any real engineering before coming onto Reddit). They ran a skeleton crew of some 10ish people for a long time (none from any exceptional backgrounds. One of them was from Sun & Oracle, after their prime).

Same group that started with a node front-end, python back-end monolith, with a document-oriented database structure (i.e., they had two un-normalized tables in Postgres to hold everything). Later they switched to Cassandra and kept that monstrosity -- partly, because back then no one knew anything about databases except sysadmins.

Back then they were running a cache for listing retrieval. Every "reddit" (subreddit and the various pages, like front-page, hot, top, controversial, etc.) listing is kept in a memcache. Inside, you have your "top-level" listing information (title, link, comments, id, etc.). The asinine thing is that cache invalidation has always been a problem. They originally handled it using RabbitMQ queues: votes come in, they're processed, and then the cache is updated. Those things always got backed up, because no one thought about batching updates on a timer (or how to use lock-free) (and no one knew how to do vertical scaling, and when they tried, it made things even harder to reason about). You know what genius plan they had next to solve this? Make more queues. Fix throughput issues by making more queues, instead of fixing the root cause of the back-pressure. Later, they did "shard"/partition things more cleanly (and tried lock-free) -- but they never did any real research into fixing the aforementioned problem (how to handle millions of simple "events" a day... which is laughable thinking back to it now).

That's just for listings. The comment trees are another big bad piece. Again, stored un-normalized -- but this time actually batched (not properly, but it is a step up). One great thing about un-normalized databases and trees, is that there are no constraints on vertices. So a common issue was that you could back-up your queue (again) for computing the comment trees (because they would never get processed properly) (and you could slow the entire site to a crawl because your message broker was wasting its time on erroneous messages).

Later, they had the bright idea to move from a data center to AWS -- break everything up into microservices. Autoscaling there has always been bungled.

There was no strong tech talent, and no strong engineering culture -- ever.

-------

My 2 cents: it's the listing caches. The architecture around it was never designed to handle checking so many "isHidden" subreddits (despite that they're still getting updates to their internal listings) -- and it's coming undone.

onemiketwelve · on June 13, 2023

I read this as a pretty scathing dressing down of incompetent engineering at reddit. But after having breakfast, what I'm realizing again is that perfect code and engineering are not required to make something hugely successful.

I've been parts of 2 different engineering teams that wrote crap that I would cuss out but were super succesfuly, and most recently I joined a team that was super anal about every little detail. I think business success only gets hindered on the extremes now. If you're somewhere in the middle, it's fine. I'd rather have a buggy product that people use than a perfect codebase that only exists on github.

eloquenceN · on June 14, 2023

Agreed. In fact, I believe success stories actually skew the other way. Those that actually build something that gets off the ground and is successful will in many cases not have the time to write perfect code.

andreareina · on June 13, 2023

https://en.m.wikipedia.org/wiki/Consistent_hashing

award_ · on June 12, 2023

Yep, this is definitely just speculation, but I think this is it. Code/queries that worked fine at small load for private subs just doesn't work at scale when tons of subs are private.

CodeCompost · on June 12, 2023

Isn't the source code of Reddit on GitHub? At least it used to be...

FractalParadigm · on June 12, 2023

They stopped updating/supporting the code about 6 months before the big redesign (last push Oct '17, redesign launched Apr '18). Call me a bit of a conspiracy-theorist, but they just happened to raise $200 million on their Series C less than 3 months prior to abandoning their commitment to open-source.

geekthattweaks · on June 12, 2023

They have two older versions up but nothing recent.

core-utility · on June 12, 2023

Wouldn't want the recent stuff anyways (except to pontificate)

raldi · on June 12, 2023

It could also be that the cache-hit rate is like 1/10th normal with the would-be front page full of smaller subreddits today.

burnte · on June 12, 2023

My personal guess is it's down on purpose so they can say only 5% of subs who said they would go dark actually did, we just happened to have a service outage that day so they can push their own narrative t investors. Spez doesn't care anymore, he's focused on that IPO payday and to tell with everyone else. He's a liability to the company now, but the board isn't acting.

cyral · on June 12, 2023

Here's a site which tracks how many subs have been made private. It's actually crazy that it's up to 93% of them now: https://reddark.untone.uk/

thesh4d0w · on June 12, 2023

The site is decieving, it's 93% of the ones that committed to going dark. It's not 93% of all subreddits.

If you search for r/news for example, it's nowhere on that page.

mlyle · on June 12, 2023

And most of the ones that "haven't" have restricted submissions... e.g. no new posts.

readthenotes1 · on June 12, 2023

I've tried a few listed as public that are actually private. reddit is darker than the report (as they mention)

growt · on June 12, 2023

It's funny that /r/therewasanattempt and /r/whatcouldgowrong are on that list but still public (or public again?)

wellthisisgreat · on June 12, 2023

/r/amitheasshole staying public is funny too

vitorgrs · on June 12, 2023

Subs that are supposed to go down. This don't list all sub. A own a small sub that is not there, for example.

paulmd · on June 12, 2023

Yup, this seems super plausible. Even things like the frontpage feed and user comment history probably work on the assumption that most of the data they're pulling is probably visible (which leads to the "just filter it in the backend" approach), but also every external link or bookmark into a now-private sub will trigger the same kind of check.

This is likely shifting load in very unpredictable ways... I'm sure a sibling comment is right that it's probably less load overall in general, but it'll be going down codepaths that aren't normally being exercised by >95% of requests and aren't working on the assumption that virtually all content is being hidden.

There's probably some microservice instances that are currently melting themselves into a puddle until they can deploy additional instances, additional DB shards, or roll out patches to fix dumb shit that wasn't dumb until the usage assumptions were suddenly inverted. Meanwhile there's tons of other instances that are probably sitting idle.

sleight42 · on June 12, 2023

Having worked at this scale, this is a fine guess! This scenario would have been a distant edge case for them. They likely didn't optimize for it. BOOM.

munk-a · on June 12, 2023

Personally I'd like to believe that the servers themselves are standing in solidarity with the blackout over the API changes. I, for one, welcome our new robot overlords.

HeavyStorm · on June 12, 2023

Came here just to say this.

Speculation, but having major subs private change the load profile which may result in the outage. Reddit certainly wasn't optimized for this.

lillesvin · on June 12, 2023

I would imagine that a normal visit generates more backend traffic, given that it needs to fetch posts, thumbnails, etc. whereas a visit to a private sub wouldn't need to check more than authorization.

I could easily be wrong though, I haven't done web development for years.

mfer · on June 12, 2023

They use a microservice architecture. Some services could scale well in servicing all those assets. What handles checking access to private subs may not.

You can’t treat the scaling as a binary feature, that it does or doesn’t

jallen_dot_dev · on June 12, 2023

Sure, but that type of traffic is expected and they can handle it with things like caching and autoscaling. I'm suggesting that a part of the system that usually doesn't get a lot of requests wasn't designed to handle a huge influx of requests.

bamfly · on June 12, 2023

All that other stuff's easy to cache. Authorization's cacheable, too, kinda, with some trade-offs, but they may not have bothered if it'd never been a problem before. Or this particular check my have been bypassing that caching, and it'd never cause a problem before because there weren't that many of those checks happening.

You start getting a lot more DB queries than usual, bypassing cache, the DB's remote, it's clustered or whatever, now you've got a ton more internal network traffic and open sockets, latency goes up, resources start to get exhausted...

HankB99 · on June 13, 2023

Not so wild IMO. I frequently have trouble loading pages on Reddit so I suspect any additional pressure could push it over the edge. It might be as simple as more users checking in to see if their favorite Reddit has gone private or shut down.

cranekam · on June 13, 2023

If a DB check is needed to see if a sub is private or not it has to happen for every request. You can’t just limit the check to private subs because it’s not known if they are private or not at the time.

Reddit goes wrong often so I expect this outage could have any number of causes.

jallen_dot_dev · on June 13, 2023

> You can’t just limit the check to private subs because it’s not known if they are private or not at the time.

That's not necessarily true. Perhaps the status of subreddits is cached (because there's no reason to hit the DB 100 times/second to check if r/funny is private or public). But for a given request to a private sub, it would need to check each user.

taneq · on June 13, 2023

You don’t reckon it’s just disgruntled users running “set all my past comments to Boo Reddit Boo” scripts? I don’t imagine it’d take a huge proportion of users doing that simultaneously to slag the servers.