Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wild guess: visiting a private sub requires an extra call to a service/db to check if the user can view it. Normally there are only a small number of these checks because private subs were usually small communities. Now, many large subs having switched private is causing some poor mircoservice somewhere to get hammered.


I never worked at this scale, but could it also be that different subs are horizontally scaled and with so many people reverting to the subs that are still open the load is unevenly balanced?


Good question! And few people get to work at this scale, so it's not an unreasonable guess. I'll join you in speculating wildly about this, since, hey, it's kind of fun.

IMHO sharding traffic by subreddit doesn't pass the smell-test, though. Different subreddits have very different traffic patterns, so the system would likely end up with hotspots continuously popping up, and it'd probably be toilsome to constantly babysit hot shards and rebalance. (Consider some of the more academic subreddits vs. some of the more meme-driven subreddits — and then consider what happens when e.g. a particular subreddit takes off, or goes cold.)

Sharding on a dimension that has a more random/uniform distribution is usually the way to go. Off the top of my head (still speculating wildly and basically doing a 5-minute version of the system-design question just for fun), I'd be curious to shard by a hash of the post ID, or something like that. The trick is always to have a hashing algorithm that's stable when it's time to grow the number of shards (otherwise you're kicking off a whole re-balancing every time), and of course I'm too lazy to sort that out in this comment. I vaguely remember the Instagram team had a really cool sharding approach that they blogged about in this vein. (This would've been pre-acquisition, so ancient history by Silicon Valley standards.)

As for subreddit metadata (public, private, whatever), I'd really expect all of that to be in a global cache at this point. It's read-often/write-rarely data, and close-to-realtime cache-invalidation when it does change is a straightforward and solved problem.


For really really large sets, you'll still eventually want to reduce read compute costs by limiting specific tenants to specific shards in order to reduce request fan out for every single read request. If say I get a super quiet forum, would it make sense to query 2 shards or 6000? Clearly there's a loss of performance when all read requests have infinite fan out.


A good, but wrong, assumption is to assume Reddit's engineers know what they're doing.

The founding Reddit team was non-technical (Even Spez. I've been to UVA; it's not an engineering school; and spez had never done any real engineering before coming onto Reddit). They ran a skeleton crew of some 10ish people for a long time (none from any exceptional backgrounds. One of them was from Sun & Oracle, after their prime).

Same group that started with a node front-end, python back-end monolith, with a document-oriented database structure (i.e., they had two un-normalized tables in Postgres to hold everything). Later they switched to Cassandra and kept that monstrosity -- partly, because back then no one knew anything about databases except sysadmins.

Back then they were running a cache for listing retrieval. Every "reddit" (subreddit and the various pages, like front-page, hot, top, controversial, etc.) listing is kept in a memcache. Inside, you have your "top-level" listing information (title, link, comments, id, etc.). The asinine thing is that cache invalidation has always been a problem. They originally handled it using RabbitMQ queues: votes come in, they're processed, and then the cache is updated. Those things always got backed up, because no one thought about batching updates on a timer (or how to use lock-free) (and no one knew how to do vertical scaling, and when they tried, it made things even harder to reason about). You know what genius plan they had next to solve this? Make more queues. Fix throughput issues by making more queues, instead of fixing the root cause of the back-pressure. Later, they did "shard"/partition things more cleanly (and tried lock-free) -- but they never did any real research into fixing the aforementioned problem (how to handle millions of simple "events" a day... which is laughable thinking back to it now).

That's just for listings. The comment trees are another big bad piece. Again, stored un-normalized -- but this time actually batched (not properly, but it is a step up). One great thing about un-normalized databases and trees, is that there are no constraints on vertices. So a common issue was that you could back-up your queue (again) for computing the comment trees (because they would never get processed properly) (and you could slow the entire site to a crawl because your message broker was wasting its time on erroneous messages).

Later, they had the bright idea to move from a data center to AWS -- break everything up into microservices. Autoscaling there has always been bungled.

There was no strong tech talent, and no strong engineering culture -- ever.

-------

My 2 cents: it's the listing caches. The architecture around it was never designed to handle checking so many "isHidden" subreddits (despite that they're still getting updates to their internal listings) -- and it's coming undone.


I read this as a pretty scathing dressing down of incompetent engineering at reddit. But after having breakfast, what I'm realizing again is that perfect code and engineering are not required to make something hugely successful.

I've been parts of 2 different engineering teams that wrote crap that I would cuss out but were super succesfuly, and most recently I joined a team that was super anal about every little detail. I think business success only gets hindered on the extremes now. If you're somewhere in the middle, it's fine. I'd rather have a buggy product that people use than a perfect codebase that only exists on github.


Agreed. In fact, I believe success stories actually skew the other way. Those that actually build something that gets off the ground and is successful will in many cases not have the time to write perfect code.



Yep, this is definitely just speculation, but I think this is it. Code/queries that worked fine at small load for private subs just doesn't work at scale when tons of subs are private.


Isn't the source code of Reddit on GitHub? At least it used to be...


They stopped updating/supporting the code about 6 months before the big redesign (last push Oct '17, redesign launched Apr '18). Call me a bit of a conspiracy-theorist, but they just happened to raise $200 million on their Series C less than 3 months prior to abandoning their commitment to open-source.


They have two older versions up but nothing recent.


Wouldn't want the recent stuff anyways (except to pontificate)


It could also be that the cache-hit rate is like 1/10th normal with the would-be front page full of smaller subreddits today.


My personal guess is it's down on purpose so they can say only 5% of subs who said they would go dark actually did, we just happened to have a service outage that day so they can push their own narrative t investors. Spez doesn't care anymore, he's focused on that IPO payday and to tell with everyone else. He's a liability to the company now, but the board isn't acting.


Here's a site which tracks how many subs have been made private. It's actually crazy that it's up to 93% of them now: https://reddark.untone.uk/


The site is decieving, it's 93% of the ones that committed to going dark. It's not 93% of all subreddits.

If you search for r/news for example, it's nowhere on that page.


And most of the ones that "haven't" have restricted submissions... e.g. no new posts.


I've tried a few listed as public that are actually private. reddit is darker than the report (as they mention)


It's funny that /r/therewasanattempt and /r/whatcouldgowrong are on that list but still public (or public again?)


/r/amitheasshole staying public is funny too


Subs that are supposed to go down. This don't list all sub. A own a small sub that is not there, for example.


Yup, this seems super plausible. Even things like the frontpage feed and user comment history probably work on the assumption that most of the data they're pulling is probably visible (which leads to the "just filter it in the backend" approach), but also every external link or bookmark into a now-private sub will trigger the same kind of check.

This is likely shifting load in very unpredictable ways... I'm sure a sibling comment is right that it's probably less load overall in general, but it'll be going down codepaths that aren't normally being exercised by >95% of requests and aren't working on the assumption that virtually all content is being hidden.

There's probably some microservice instances that are currently melting themselves into a puddle until they can deploy additional instances, additional DB shards, or roll out patches to fix dumb shit that wasn't dumb until the usage assumptions were suddenly inverted. Meanwhile there's tons of other instances that are probably sitting idle.


Having worked at this scale, this is a fine guess! This scenario would have been a distant edge case for them. They likely didn't optimize for it. BOOM.


Personally I'd like to believe that the servers themselves are standing in solidarity with the blackout over the API changes. I, for one, welcome our new robot overlords.


Came here just to say this.

Speculation, but having major subs private change the load profile which may result in the outage. Reddit certainly wasn't optimized for this.


I would imagine that a normal visit generates more backend traffic, given that it needs to fetch posts, thumbnails, etc. whereas a visit to a private sub wouldn't need to check more than authorization.

I could easily be wrong though, I haven't done web development for years.


They use a microservice architecture. Some services could scale well in servicing all those assets. What handles checking access to private subs may not.

You can’t treat the scaling as a binary feature, that it does or doesn’t


Sure, but that type of traffic is expected and they can handle it with things like caching and autoscaling. I'm suggesting that a part of the system that usually doesn't get a lot of requests wasn't designed to handle a huge influx of requests.


All that other stuff's easy to cache. Authorization's cacheable, too, kinda, with some trade-offs, but they may not have bothered if it'd never been a problem before. Or this particular check my have been bypassing that caching, and it'd never cause a problem before because there weren't that many of those checks happening.

You start getting a lot more DB queries than usual, bypassing cache, the DB's remote, it's clustered or whatever, now you've got a ton more internal network traffic and open sockets, latency goes up, resources start to get exhausted...


Not so wild IMO. I frequently have trouble loading pages on Reddit so I suspect any additional pressure could push it over the edge. It might be as simple as more users checking in to see if their favorite Reddit has gone private or shut down.


If a DB check is needed to see if a sub is private or not it has to happen for every request. You can’t just limit the check to private subs because it’s not known if they are private or not at the time.

Reddit goes wrong often so I expect this outage could have any number of causes.


> You can’t just limit the check to private subs because it’s not known if they are private or not at the time.

That's not necessarily true. Perhaps the status of subreddits is cached (because there's no reason to hit the DB 100 times/second to check if r/funny is private or public). But for a given request to a private sub, it would need to check each user.


You don’t reckon it’s just disgruntled users running “set all my past comments to Boo Reddit Boo” scripts? I don’t imagine it’d take a huge proportion of users doing that simultaneously to slag the servers.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: