Reddit's photo albums broke due to Integer overflow of Signed Int32

madrox · on Dec 13, 2022

A long time ago we discovered Twitter used a round robin of three servers for assigning IDs to tweets. We inferred the round robin was done by doing mod 3 of a signed int32, and because that space doesn't divide neatly by two it meant one of the three servers saw less load than the others and we could map ID assignment volume according to how often it overflowed and hence estimate total tweet volume for a given period.

Some of the details escape me (this was a decade ago) but it was a fun combination of statistical inference and CS knowledge that I don't get to use often. Whenever integer overflow comes up in a systems engineering context I get a little tickled.

liquidgecka · on Dec 13, 2022

I am pretty sure that snowflake didn't use a mod of a signed int32. It used a service discovery pool as part of finagle (and prior to that dns iirc). The server used a very simple method internally to convert time into a integer (that was 52 bits because of javascript). In fact it was completely open source: https://blog.twitter.com/engineering/en_us/a/2010/announcing...

The integer generation was pretty simple, there was a fixed id of each server, and unless I a mistaken we have 5 servers per datacenter. Each id was basically <time><offset><id> where time was a millisecond timer, offset was the number of ids generated in that same millisecond by the same server, and id was the machines unique identifier. When we first talked about this process I thought that offset was going to roll, every id would increment it by one. This was changed to resetting it every millisecond specifically so that it would obscure tweet volumes.

At the time I remember reading a LOT of articles estimating tweet volume and most of them were way, way off. I don't know that we ever really put effort into correcting them though. =)

* - Does not account for changes in the system post 2012.

madrox · on Dec 13, 2022

Awesome to hear from someone on the other side of the API with knowledge of this. This is ringing bells for me. Yeah, the id parameter was how we knew how many servers there were, and we saw more assignment in some servers than others that neatly mapped to a int32 max failing to divide by the number of IDs we saw. I thought I recall Twitter confirming that was how round robin happened but I could be totally misremembering. We never got a contact to talk to Twitter about it. FWIW, we did eventually see this fixed. I imagine it was pretty easy to spot one server seeing less load than others.

The offset was actually how we calculated volume, because millisecond collisions become a variant of the german tank problem[1]. A few times when y'all made tweet volumes public it mapped pretty closely with our estimates.

This was around 2011, so your knowledge should be relevant.

1: https://en.wikipedia.org/wiki/German_tank_problem

liquidgecka · on Dec 14, 2022

I think what your describing was actually a problem that I introduced in our bind configuration back when we used DNS for service discovery. Its not exactly what your describing but I can totally see how it would appear that way externally.

So initially we used 2 bind servers with authoritative zones serving using round robin. This worked fairly well as the load on each server was high enough to keep the round robin flowing and the connections from ruby would "stick" for 30 seconds anyway. We had a very large pool of front ends so its not like it mattered. Eventually we had to put a local bind cache on each server, but this introduced another super annoying bug. For some reason, the interaction between the authoritative server, and the caching server would end up causing the served record set to be ordered the same. Normally the authoritative server would serve A, B, C for the first query, then B, C, A for the second, etc. When the cache in the middle for some reason it would reset al queries to A, B, C after the refresh happened. So effectively every 30 seconds it would reset to A, B, C and start round robin again. Since the front ends would connect and stay sticky via connection reuse for a while this meant that A got like 50% of the load, B got 33%, and C got 17%. I am guessing you latched on to this by seeing that one of the servers was horribly underutilized in comparison and reproduced the math accidentally. =)

The fix for this was the massively disgusting, but super effective "make every single machine a DNS zone secondary". Rather than having a simple record cache we just fetched the whole zone if it changed. This actually made DNS quite a bit faster as every machine had a local copy of the zone.. Once that happened distribution returned to normal (20/20/20/20/20) fix for all of our internal services which used DNS for discovery, including snowflake.

madrox · on Dec 14, 2022

This is awesome and completely retconning a 10 year old project for me. I was working on social media analytics at Disney and we were exploring ways to measure twitter conversation of our brands, which led to us attempting to estimate total conversation volume, which is why this technical nuance was relevant to us. It was a wildly experimental time. Thanks for the story!

donut · on Dec 14, 2022

This thread is amazing.. thank you both for sharing!

Throw10987 · on Dec 14, 2022

Agree, was super excited when I saw reference to the German tank problem as that is immediately what I thought of in the first post.

dotancohen · on Dec 14, 2022

I dream of the day in the future, possibly in Eden, when I'll have opportunity to discuss the things I've reverse engineered with the guys and gals who actually engineered them. This conversation in enlightening to witness.

I will have an especially hard time phrasing questions in a respectful manner for the Adobe devs, though ))

liquidgecka · on Dec 14, 2022

Also, totally unrelated Twitter story (because I am nostalgic lately and you reminded me of it)

Early in my time there we had a major, unexpected load spike in the middle of the day. People where tweeting like crazy, breaking records for volume, etc. The system was groaning under strain but it mostly held. Turns out Michael Jackson had died. We sustained 452 (or maybe 457, it was roughly 450-something) tweets a second. this quickly became 1 "MJ" worth of tweets. We informally used this to define load the entire time I was there. Within a few months we got to a point where we were never BELOW 1 MJ, within a year I think we had peaked at double digit MJs and sustained several even in the lowest periods. Before I left we had hit 1 MJ in photo tweets, etc.

Around the time I left we did something like 300 MJ's per second, and I was only there 3 years.

madrox · on Dec 14, 2022

I remember those days before snowflake and Blaine was desperately trying to keep the lights on. It's why no one had time to talk to us (marketers and related disciplines) back then. Even Facebook said "all our focus is on acquiring users. You have no choice but to meet us on our terms because we own the eyeballs." Everything was growing so fast. Hard to believe that was only a decade ago.

pg314 · on Dec 14, 2022

Musk claimed a record of 20,000 of tweets per second recently [1]. How does that square with what you’re saying? 300 times 450 is closer to 150,000. Am I missing something?

[1]https://twitter.com/elonmusk/status/1595505413113323520?s=20...

Clent · on Dec 14, 2022

That tweet does not claim 20,000 is a record.

The event also did not reach 20,000, it "almost" hit it.

pg314 · on Dec 14, 2022

You’re right. I misinterpreted ‘record usage’. I found a 2013 blog post [1] claiming a 143 thousand tweets per second peak.

[1] https://blog.twitter.com/engineering/en_us/a/2013/new-tweets...

throwawaaarrgh · on Dec 14, 2022

That's interesting; I worked on a different site that also got record traffic when MJ died. I wonder if that happened to every site that had a chat or news feature.

Kind of obvious now, but I bet we could detect major world news just by sampling traffic size of chat sites.

lathiat · on Dec 14, 2022

Someone previously created a tweet linking to itself by predicting the likely ID range: https://oisinmoran.com/quinetweet

manigandham · on Dec 14, 2022

A simpler scheme I've used for adtech (billions of requests per day) is to simply reserve a chunk of numbers for each server from a central source. Easy to implement, very fast since each node can just increment in process, and using a 64-bit integer is effectively infinite.

hyperdimension · on Dec 14, 2022

That's how Active Directory Domain Controllers handle the distributed creation of SIDs too.

duskwuff · on Dec 13, 2022

Twitter didn't always use Snowflake -- that was introduced in November 2010. There was another, much simpler algorithm used before that which generated much smaller IDs (e.g. under 3e10).

liquidgecka · on Dec 13, 2022

Yea.. "auto increment" in mysql. it SUCKED in so many ways, primary of which was that it required a single mysql database to be the primary for all tweet writes. Every tweet just incremented a counter by 1. As we scaled up that server got overloaded rapidly and became a huge single point of failure. Snowflake was introduced to solve several problems at once. =)

pwdisswordfish0 · on Dec 14, 2022

> 52 bits because of javascript

But IEEE 754 doubles have a significand that supports a 53-bit range. What am I missing?

Dylan16807 · on Dec 15, 2022

You're not missing anything, people just forget the implied bit sometimes.

spullara · on Dec 14, 2022

negative numbers not used?

pwdisswordfish0 · on Dec 14, 2022

That doesn't make any sense.

vecter · on Dec 13, 2022

Wouldn't that unevenness only affect 2^31 - 2 and 2^31 - 1, so a negligible fraction of the integers? Was that tiny discrepancy enough to make your calculations?

In other words, what do you mean that it was done by doing mod 3 of a signed int32? If it was a monotonically increasing or random int32, I don't see how that unevenness would manifest in a meaningful way.

madrox · on Dec 14, 2022

In another subthread, we realized my memory was wrong and we were measuring millisecond collisions. The serving ID imbalance was a side-effect. Also, it might've been an int16 I was thinking of but turns out the whole thing was shadows on cave walls.

mashygpig · on Dec 13, 2022

Maybe a dumb question, but I don’t follow what you mean by “the space doesn’t divide neatly by two” and also how that connects with overflowing ints. Asking because I’m genuinely curious and would like to know more about this. Sounds really neat!

_3u10 · on Dec 13, 2022

If they were incrementing and modding wouldn’t that server see an extra 1/2 billionth more traffic?

I don’t get how mod 3 affects anything if you’re just incrementing…

manigandham · on Dec 14, 2022

If it’s round robin then it should be an even load, how does the modulo change that exactly?

Also what number are they using to modulo and where is that happening? Because at that point don’t they already have an incrementing ID before generating another one?

andreareina · on Dec 14, 2022

Take a 3-bit counter:

    0->A
    1->B
    2->C
    3->A
    4->B
    5->C
    6->A
    7->B

A and B get hit three times while C only twice, so it will see 66% utilization compared to A and B

EDITED s/once/twice/ thanks CyberDildonics

CyberDildonics · on Dec 14, 2022

while C only once

You listed C twice

woodruffw · on Dec 14, 2022

That's a typo. You can still see they're correct about the ratio (two "C"s for every three "A"s and "B"s).

manigandham · on Dec 14, 2022

There's no ratio. It's even across all of them, as long as the integer keeps incrementing. One more number (9) instead of stopping at 8 and there would be an even spread.

ahmedtd · on Dec 14, 2022

Not when the counter overflows back to 0. If it's a 3 bit counter, 0 is A again, not C.

manigandham · on Dec 14, 2022

The comment says a 32-bit signed int. Where is the 3-bit assumption coming from?

iforgotpassword · on Dec 14, 2022

The comment starts with that assumption for the sake of a concise example. do you expect them to write the whole sequence out using a 32bit counter? :-D

Thorrez · on Dec 14, 2022

When you divide 2^32 across 3 machines, you get 1431655766, 1431655765, 1431655765. They're nearly identical, just an itsy bitsy bit different. This isn't going to cause one machine to have noticeably less load than the others.

iforgotpassword · on Dec 15, 2022

That's not what's being discussed here though.

Thorrez · on Dec 15, 2022

Then what is being discussed? I thought this thread was about 3 servers, of which one has lower load due to an int32 overflowing.

manigandham · on Dec 14, 2022

Then it's irrelevant. The point is that the remainders are evenly distributed as the integer is incremented, with at most a single round being short. That's just how the math works.

iainmerrick · on Dec 14, 2022

"A single round being short" is exactly what they're talking about.

Edit to add: whether that's a big enough effect for the use case they're talking about, I don't know. This sort of thing is definitely significant in cryptographic code, though.

andreareina · on Dec 14, 2022

Edited, thanks

manigandham · on Dec 14, 2022

That doesn't change anything... it's still round robin. You just stopped at an arbitrary number of 8 integers instead of 9.

lofatdairy · on Dec 14, 2022

It's not arbitrary, GP stated it was a 3 bit counter. In GGP (or something, not sure how far down in the thread we are), they were referring to a 32 bit int counter until overflow. If you bin each number from 0 to 2^32 - 1 by mod 3, you don't get 3 bins of equal sizes, 1 bin always comes out smaller.

manigandham · on Dec 14, 2022

Smaller by a single number at most, it's effectively insignificant. Not sure where the 3-bit counter assumption came from as the original post said 32bit signed int.

pedrovhb · on Dec 13, 2022

That's really neat, I'd love to hear more about it. Was this something you were actively trying to find out, or was it poking around until something caught your eye?

madrox · on Dec 14, 2022

A little of both. I was working in social media analytics, and we were collecting everything we could to understand how to communicate the value of this new medium to businesses who could use twitter for marketing. This was still in an era where privacy wasn't at the front of anyone's minds, so there were zero retention policies. Hard to believe that was only 10 years ago.

Eventually, we learned to treat Twitter as a lead generation tool for off-platform activity and apply old school funnel mechanics to it. The next problem became how to build a follower count. Sadly, that problem is what I think led to extremism on the platform. Hence: https://madrox.substack.com/p/yet-another-quitting-twitter

kypro · on Dec 13, 2022

Not op, but at an ecommerce company I worked for we did similar things to track how well our competitors were doing relative to us so could be something like that.

Also collecting data like this can be useful if you want to beat markets.

ipqk · on Dec 14, 2022

Reminds me of the self-quoting tweet: https://news.ycombinator.com/item?id=25244872

boosteri · on Dec 13, 2022

Nostalgic flashback to Premier Manager games, where players stats decreased as they aged. When they went /below 0/ they flipped around to 127. So a good strategy was to scout out really bad players from lower leagues about to hit age 30+. And offer them very long contracts to prevent them retiring .. and give time for most of their stats to flip around, turning them into superstars.

johnfarrelldev · on Dec 14, 2022

I always found the funniest occurrence of this was in the Civ game though it seems it originally being a bug is disputed.

https://en.wikipedia.org/wiki/Nuclear_Gandhi

vikingerik · on Dec 14, 2022

The bug never existed at all in Civ 1. It was an urban legend all along. Similar behavior was intentional in Civ 5 as a joke, which convinced everyone that it really did happen in Civ 1 when it never did.

AceJohnny2 · on Dec 15, 2022

Huh.

https://en.wikipedia.org/wiki/Nuclear_Gandhi

RedShift1 · on Dec 14, 2022

Mandela effect

yen223 · on Dec 14, 2022

It's the result of people repeating what they read on the Internet, not what they personally experienced.

This happens way too often

fknorangesite · on Dec 14, 2022

> This happens way too often

You're not wrong; but fyi it only happens in your originating universe. Over here we all experienced it.

Dylan16807 · on Dec 15, 2022

People being lied to is not the Mandela effect. The Mandela effect is about things you personally remember.

A Mandela effect here would be if the explanations themselves that talked about overflow never existed.

Rebles · on Dec 13, 2022

Two days ago, Reddit ids have finally incremented passed the 2,147,483,647, or the maximum range of a signed int32. It seems one of Reddit's subsystems, the one that serves its photo albums broke due to the integer overflow.

cowsup · on Dec 13, 2022

Strange thing is that photo albums re relatively new. Imgur was the go-to host for Reddit, and then they made their own uploader a looong time later. The "albums" functionality only came out in July of 2020, according to a Google search.

Seems this was less likely a "someone else will deal with it" problem, and more of a development / QA testing problem.

Gigachad · on Dec 13, 2022

For some reason most stuff still defaults to i32 and a lot of people use them for new code. At this point I'd not be against linters warning against using 32 bit ints unless you have a good reason.

bushbaba · on Dec 13, 2022

My alma maters vending machines used an unsigned integer for understanding school debt card balance. However the school debt card used a signed integer to allow for small negative balances.

Well students uncovered the flaw and had their cards be at negative 1 dollar, which the vending machine read as a very large balance.

thegeomaster · on Dec 14, 2022

They're still a good choice for most perf critical code due to the reduced memory use.

kuroguro · on Dec 14, 2022

What I don't get is why you'd want a signed int for an id.

curioussavage · on Dec 14, 2022

I’m pretty sure the table in question stores image metadata for all user uploaded images. As well as images scraped from posted links which goes back way before images in posts

jimmytucson · on Dec 14, 2022

A new feature but it wasn’t built on a new codebase. Reddit is a monolith and a lot of things users think of as different “entities” live in the same set of tables.

fdgsdfogijq · on Dec 13, 2022

Someone probably joked they would never reach that scale when they wrote that code

kristopolous · on Dec 13, 2022

Thinking "if that ever gets anywhere close to a problem we'll have vast resources and plenty of time to fix it" and then, I'm guessing, that person left a few months later and nobody owned that part of the code because it worked.

Then 10 or so years went by...

Whenever I write code like that which may break in say, 5 years, I'll sign it in the comments and put my personal email and phone number inviting future people to call me and I'll fix it for free (cause I take responsibilities for my code pretty seriously). Nobody has ever taken me up on it though...

monsieurbanana · on Dec 13, 2022

To people reading this: please never ever do that, unless it was agreed upon with your client/employer beforehand (and compensated accordingly).

If you know the code may break in 5 years, why don't you fix it now? Because your employer doesn't want to pay you to do that (probably for good reason, surely there's higher priorities).

Then why would you do it for free years later?

sosodev · on Dec 13, 2022

I'm fairly certain most managers would argue that I'm wasting my time if I spend it chasing bugs that won't happen for several years.

not2b · on Dec 14, 2022

Sometimes it's because they don't know it will break in 5 years. They weren't anticipating exponential growth and when they started, the i32 range was a thousand times more than they needed. Ten more doublings later ...

kristopolous · on Dec 13, 2022

No.

You can say, anticipate breaking changes that is calendared to happen as an eventually rollover or deprecation

Essentially things have to work one way now and there's solid reason to think it will no longer work that way later and there's no valid path to fix because exogenous conditions change.

This alternatively, can be done in bad faith to guarantee a future paycheck by intentionally placing timebombs in and then charging high rates to come back. There's incentives for abuse.

I'm making it clear and acting in good faith. I can give them 2 hours for free in 5 years, whatever. I'm a competent and responsible person and I get paid well by not being so cheap.

monsieurbanana · on Dec 14, 2022

We're not talking about working in bad faith, that's bad obviously. But there's no reason for most people to give away even 2 hours of their time for free. If you're working as a freelancer and it's your way of building a portfolio of clients, that's a good reason to give away those two hours. But I'd rather not have someone with little work experience read that, think it's reasonable, and apply it themselves.

In my first year of university I was working for a very small web company. I got paid per website an amount that was enough for my student needs, nothing much. I wasn't a very good developer in my 1st year, so I often received emails about bugs in past projects. The bugs were my fault, so naturally I fixed them in my own free time.

It took me 2 years to see how badly I was getting owned by the company. If they wanted less bugs, they should have asked someone with experience and a much higher salary.

josho · on Dec 14, 2022

The employer also shouldn't take you up on the offer because now you are engaging in work without a contract, which opens up contract issues, probably breaks things like SOC compliance, etc.

I like your accountability, but it's sufficient to document the issue in code, open a jira (or equivalent) and move on.

jacobr1 · on Dec 13, 2022

And they weren't even wrong! The challenge is addressing the tech-debt, not the choice to incur it.

silisili · on Dec 14, 2022

Oh god, bad memories. You reminded me of a guy I worked with who wrote absolutely terrible, unsafe code and when called out said they are 'caviar problems.'

When I asked what in the hell a caviar problem was, he said it meant by the time we had to worry about those things, we'd all be so rich we'd be eating caviar.

bagels · on Dec 14, 2022

The reality is that some other person brought in later that is not rich will have to fix it. What a jerk.

shock-value · on Dec 14, 2022

Good practice would be leaving a comment with a TODO, an explanation of the problem, and some basic instructions as to a possible fix.

Leaving your email and phone number is absurd and should not be promoted as good practice. No sane company would take you up on the offer anyway.

lmm · on Dec 14, 2022

That's maybe a reasonable thing to do if you're an independent business selling the code, but it's a scab move if you're doing that as an employee. Never work for free.

CyberDildonics · on Dec 14, 2022

Why would you write something that might break in 5 years? Why not just use a 64 bit integer in any situation like this an be done with it?

geobmx540 · on Dec 14, 2022

Probably sufficient to write a test that will break in 5 years - X [days|months] with notes in the test about the problem and how to fix

nonethewiser · on Dec 14, 2022

Hm. I wonder why no one ever called you.

kristopolous · on Dec 14, 2022

Because there's a dumb culture of treating people who left like they're from beyond the grave where you have to divine their intentions like some programming ouija board.

I broke that unspoken rule and got a really intractable problem that held us up for days finally fixed by making a short phone call early in my career and told myself if I was ever causing what we were going through I'd want to make it clear that I'm open to that rule being broken as well.

Apparently even suggesting such a rule should be questioned and to try human solutions to human problems is met with mockery and disdain. Alright, cool.

bagels · on Dec 14, 2022

It's bold. I'm sure if one of those time bombs went off, the call might not be to ask for help, but to impolitely complain.

sph · on Dec 13, 2022

So you migrate to int64 and one day someone will wonder why the hell did we ever think no one would reach 2^64 rows in a database table. Or that 2^128 IP addresses would be enough for everyone.

WhitneyLand · on Dec 13, 2022

“32-bit IPv4 addresses to 128-bit IPv6 addresses means going from 4.3 billion to 340,282,366,920,938,463,463,374,607,431,768,211,456”

Yes 128 bits is only 4x times the bit length, but the address space is exponentially bigger.

Some predict if we do run out, it might take ~100 years.

josephg · on Dec 13, 2022

> Some predict if we do run out, it might take ~100 years.

You think we can use 340,282,366,920,938,463,463,374,607,431,768,211,456 addresses in 100 years? Thats ... ahem, ambitious!

Dylan16807 · on Dec 14, 2022

In practical terms, IPv6 is arranged as a 64 bit subnet ID and 64 bit device ID.

And the subnets are in a hierarchy that helps with routing but limits the packing efficiency.

We could feasibly "run out" to the point that we can't keep doing things this way.

ipython · on Dec 14, 2022

The truth is that the address space is extremely sparse. Every ipv6 subnet is a /64 by default, yet I’m sure most of us have vastly fewer than 2^64 machines in our networks … at least for now

sokoloff · on Dec 13, 2022

In my experience, you first do an emergency re-seed to use the negative half of the int32 space.

Been through it twice in my career…

marcosdumay · on Dec 13, 2022

Those numbers are much, much larger than 2^32.

No process that touches the real world gets into 2^64. No process at all gets into 2^128, at lwast not while we live around a single star.

_gabe_ · on Dec 14, 2022

I'm just curious, I know it's a long running joke about how we're so stupid to think that we would never run out of unique digits with 2^32 possible values, but is this also the case with 64 bit values? Every new bit doubles the amount of information, so if 32 bits lasted reddit 10 years, presumably 33 bits would last them 20 years, 34 would last 40 and so on. Eventually, 64 bits would last them 10×2^32 years, which seems like a safe bet.

So am I being naive when I use 64 bit values for unique IDs? Or is it actually plausible that 64 bits is plenty of information for centuries to come?

Edit: Also, technically reddit was using signed int32s. So they really only had 2^16 unique digits. If they used unsigned int32s, then that would have bought them a lot of time.

jxf · on Dec 14, 2022

> Edit: Also, technically reddit was using signed int32s. So they really only had 2^16 unique digits. If they used unsigned int32s, then that would have bought them a lot of time.

Small correction: Signed int32s means you have 31 bits for the integer value (2^31 values), not 16 bits (2^16 values). There are 32 bits; 1 bit is used up for the sign, and 31 remain for the number.

_gabe_ · on Dec 14, 2022

Yep, I made the same mistake I was trying to call out here haha. Thanks for the correction!

KMag · on Dec 14, 2022

As others have pointed out, there are 2^31 non-negative values an Int32 can take on, and linear growth is probably a bad assumption.

I should also probably point out some ambiguity in your analysis. To be clear, under the faulty assumption of linear consumption, if Int32s get exhausted in 10 years, switching to UInt32s will last 20 years of which 10 are already spent. So, under the faulty assumption of linearity, switching to UInt32s buys you an extra 10 years.

64-bit identifiers will last quite a while. Another defensible choice would be to switch to 128-bit unguessable identifiers (such as assigning them via AES-256 in counter mode) in order to provide defence in depth. If something goes wrong with your authentication, but the attacker is likely to spend millennia hammering your REST API with invalid IDs before exploiting an actual asset, then the consequences aren't so bad.

rezonant · on Dec 14, 2022

> Edit: Also, technically reddit was using signed int32s. So they really only had 2^16 unique digits. If they used unsigned int32s, then that would have bought them a lot of time.

Signed integers have 2^31 positive values (well, just about). It doesn't take 16 bits to store the negative sign :-)

HideousKojima · on Dec 14, 2022

Not for most implementations, a signed 32 bit int goes from ~-2 billion to ~+2 billion, an unsigned from 0 to ~4 billion

taeric · on Dec 14, 2022

That isn't disagreeing? It is half the numbers. But that is just minus one exponent.

Right?

HideousKojima · on Dec 14, 2022

The GP said it would have "bought Reddit a lot of time," the post I responded to said it was merely 2^31 implying that it wouldn't have bought them a lot of time, that's the part I was replying to (though I didn't word it clearly). 2 billion more values would have bought Reddit a lot of time.

I should have taken more time to read over both posts and respond more clearly though

taeric · on Dec 14, 2022

Ah, I read the GP as a correction that it wasn't 2^16, but 2^31 values. The "doubling" was correctly stated, but the edit said they only had 2^16 values to work with. Almost certainly just a mistake in the post.

rezonant · on Dec 14, 2022

Yeah I think we're all in agreement here.

bagels · on Dec 14, 2022

For most implementations 2^31 is ~2 billion.

lilyball · on Dec 14, 2022

If 32 bits lasted 10 years, 33 bits would not last 20, as engagement is not a constant. If we look at the last 10 years, presumably there were a lot more posts in the second 5 years than in the first 5 years. I don't know what Reddit's actual user/engagement metrics show though. Certainly if they got to a stable point where they had approximately the same activity going forward as they have going back then it would, but that's not usually how social sites work.

That said, it would still take quite a while to get to 64 bits.

bagels · on Dec 14, 2022

This is just basic engineering: What rate do you anticipate creating these things? What is the upper bound rate you will create these things? Consider if every person were a customer, how many of these things would/could they create or cause to become created? How do those assumptions change over time?

IIAOPSW · on Dec 14, 2022

"we assigned everyone 9 digit phone numbers, but that ran out after 50 years of population growth led to over a billion people. Now we assign everyone an 18 digit phone number. Personally, I'm worried we might run out in another 50 years."

ThunderSizzle · on Dec 14, 2022

We should've dropped phone numbers and just make it like email - dial person-name@verizon

Infinite options at that point.

hbn · on Dec 14, 2022

I mean as long as you can still set your email-number to whatever you want and update it if you want to drop the old one.

I feel like there's also implications for spam, like you want to target X demographic, all you have to do is try numbers with common name combinations from various cultures. Like in Nathan For You when he made a targeted billboard for a psychic that said "Maria Garcia, I had a dream about you" so all the Maria Garcias who saw the sign would become potential customers.

loteck · on Dec 14, 2022

presumably

Isn't the answer to your question buried in your assumptions? For example, you seem to assume the rate of comsumption of unique IDs is static.

_gabe_ · on Dec 14, 2022

It would have to increase A LOT to significantly change the outcome. For example, according to this website[0] (no idea how accurate it is) Facebook users upload more than 300 million photos a day. At that rate, 2^32/300,000,000 = 14.3. So 32 bits would give Facebook 14 days worth of unique ids for their photos. Whereas 2^64/300,000,000 = 61,489,147,000 days, which is around 168 million years worth of ids.

All I'm saying is the jump from 2^32 to 2^64 is astronomical. I don't see using 64 bit integers for uids in my hobby code as something to be concerned about. In production code for a company I would use something more significant, but even then I feel like 64 bits will last a very long time.

[0]: https://bernardmarr.com/how-much-data-do-we-create-every-day...

brrrrrm · on Dec 14, 2022

if everyone on earth shared a memory space and every person had 4 gigabytes of byte addressable storage, we'd be at 65 bits.

chrismcb · on Dec 14, 2022

I think it is a horrible assumption to think that Reddit won't grow and will take exactly the same length of time to double the number of unique ids required. But still 64 bits should be plenty for a while. And presumably if it comes closer to using that many they'll have enough time to switch to 128

aidenn0 · on Dec 14, 2022

2^63 (assuming signed 64 bit ints) is enough for 10 billion people to post 900 million posts each. Short of faster-than-light travel or the singularity happening, it seems pretty safe.

masklinn · on Dec 14, 2022

Doubling the number of bits does not double the number of values, but squares it.

2^31 seconds is 68 years, a middling human lifetime.

2^63 seconds is 290 billion years.

masklinn · on Dec 14, 2022

(Just so we’re clear, that’s “billion” with a “b”, so 20 times the age of the universe)

eCa · on Dec 14, 2022

> have enough time to switch to 128

It appears that the original developer thought that for 32 bits :)

I say that as someone who inherited a system that allowed for 2^32 session references stored in the db. Lets just say that sessions were created a lot faster than anticipated for the amount of traffic we had.

So one fine Sunday morning in a November a long time ago we ran out.

davidjfelix · on Dec 13, 2022

A classic case of "ids aren't numbers even if you choose to make them numeric"

knodi123 · on Dec 13, 2022

ids being sortable has a lot of advantages over random guids.

marcosdumay · on Dec 13, 2022

Them being dense is advantageous too. Numbers are a very convenient format for encoding IDs, but that doesn't mean that IDs are numbers.

davidjfelix · on Dec 13, 2022

This is really a false dichotomy you don't have to use guid/uuid. I'm saying even if you use sortable auto increment numbers, stop storing them like numbers.

function_seven · on Dec 14, 2022

Doesn't numeric storage save space and make indexing faster? (This is a naive question, I'm not asserting it.)

Yeah, numbers get you fun stuff like overflows and wrap-arounds, etc. But sorting is faster (sorting "99" before "100" is more complex than 99 before 100) and space requirement is lower (6 bytes can store a unique ID for 281 trillion objects, but 6 characters only permits 1 million if you're storing them as strings).

Or is there some datatype that combines the space- and sort-efficiency of a numeric type, but doesn't bring the baggage of assuming they represent quantities?

skunkworker · on Dec 14, 2022

XID wraparound is one of the things that makes me wish Postgres would go 64bit on some of their internal data structures.

Dylan16807 · on Dec 14, 2022

Note that reddit is currently generating base36 ids starting with z.

You want everything to treat that as text? Sure. That text has been 6 characters long for ages, and it's about to hit 7. I personally expect to see more things break when that happens.

The problem of overflow isn't special to numbers.

davidjfelix · on Dec 14, 2022

You're certainly right that overflow isn't special to numbers, but my disdain for ids being treated as numbers mostly revolves around how common incidents like this are and how often a safe decision results in tempting developers to make unsafe decisions for api consistency.

Suppose you store your ids as a 64 bit number in the database, that's plenty of room, we shouldn't need to put an alarm for when we're reaching the end (although operationally you might want to still) - it's 9,223,372,036,854,775,807 when signed and 18,446,744,073,709,551,615 unsigned. That's a fine, sensible decision to make at the database level. Now suppose you have an API that interacts with these ids in the database and exposes them down to users in JSON. Javascript (and JSON by extension) numbers aren't integers they're double precision floating point numbers -- quick pop quiz, without looking, what's the maximum representable integer before you lose integer precision? Do all of your APIs work with auto increment numbers (which should still be safe) or are some of them vendor supplied, like UPCs or ISBNs? Will your developers remember this when they make a new API or will they look at an old api and think "it's a number there, I should match it"? Do some vendors front pad their numbers with 1s (this happened to us)? There are a lot of variables here and one big juicy thing developers want and know other developers want - consistency of rules. So I'm offering an easy, consistent rule:

Store it however your database lets you store it compactly, know your limits, set alarms, stop apologizing for database lack of features, but MOST IMPORTANTLY: insist to your developers that its type in code is a variable length utf-8 string, regardless of it's contents. Overflows happen, you need to accommodate them at business logic levels all the time, but stop letting developers trust numbers as being self-bound. They're not numbers in the URL, they're utf-8 letters in a string. They're not numbers in JSON, they're utf-8 letters in a string. Give them an a function that verifies the format of a string and stores it however you want in the database, but stop treating IDs like numbers even if they're numeric.

That's my opinion. Not that ints are a bad db row type.

knodi123 · on Dec 14, 2022

> stop letting developers trust numbers as being self-bound

Sorry, I thought I was following you, but I don't understand the last half of this quote. What does "trust as self-bound" mean?

davidjfelix · on Dec 15, 2022

Sorry, I could have phrased that better. Bound and validated by their container. Since an int has natural boundaries while a string could easily overflow a varchar, it's much easier to assume that less validation is needed. But maintaining consistent numeric containers across serialization boundaries is often just as likely to overflow or have encoding difficulties as strings. My assertion is developers don't trust strings and validate them by default more often than integers or numbers.

iamdual · on Dec 14, 2022

There are timestamps for sorting.

NaturalPhallacy · on Dec 14, 2022

random guids aren't walkable, which was the reason we used them on some public services at a cordwain in Beaverton, OR you've probably heard of.

scrame · on Dec 13, 2022

Ha! Slashdot had a similar problem in the early 2000s because they did a difficult migration for user/post ids, but left the indexes at 32(?).

So, everything worked great until it didn't, and they segment a lot of time future proofing it.

dvh · on Dec 14, 2022

-2,147,483,648 photos should be enough for anybody

Thaxll · on Dec 13, 2022

The famous AUTO_INCREMENT that you though you would never reach...

btown · on Dec 13, 2022

Fun fact: if you do a lot of INSERT... ON CONFLICT calls in Postgres from automated systems that are updating much more often than you insert, your autoincrement primary key can increment far far faster than your data volume (since it doesn't de-increment on a conflict) and overflow an int, grinding things to a halt. One of the more maddening outages I've had to deal with!

hu3 · on Dec 13, 2022

Similar for MySQL.

If you open a transaction, INSERT with AUTO_INCREMENT, then rollback the transaction, no data is saved, except the auto generated id is used and the next INSERT uses id+1.

bcrosby95 · on Dec 14, 2022

But MySQL's equivalent, insert...on duplicate, does not cause extra auto increments. IMHO postgres' is a much larger problem.

That said, I say that in my experience based upon locking selects in MySQL - when the row doesn't exist is a big problem, oftentimes leading to deadlocks. So insert... on duplicate key is a life saver. I don't have as much experience using postgres, so I'm not sure how much of a problem that is there, so it's possible the "on conflict" isn't as useful.

ryanianian · on Dec 14, 2022

Same with sequences in Oracle. They don't participate in transactions. Looking at the gaps between "consecutive" IDs shows surprisingly interesting behavior.

BooneJS · on Dec 14, 2022

Happened a few years ago to YouTube[0]. I don’t know why counters that start at zero and only increment are stored as signed integers.

0: https://arstechnica.com/information-technology/2014/12/gangn...

akoster · on Dec 14, 2022

Reminds me of a similar Chess.com iPad app issue from a few years back https://news.ycombinator.com/item?id=14539770

_iyhy · on Dec 14, 2022

It's 2022 we still are using int32 for anything.

darylteo · on Dec 14, 2022

If I'm understanding the shitty change log exactly, was the solution to add an extra bit?

maverwa · on Dec 14, 2022

I guess thats a joke. Adding a single extra bit would usually be more complex then going to 64bit or going to 32bit unsigned.

Well, I say that, but actually, "adding an extra bit" is basically what going from signed to unsigned would do. So maybe they just added an extra (32nd) bit?

dmtroyer · on Dec 14, 2022

unique identifiers are so passé.

mhh__ · on Dec 14, 2022

For a supposed tech company reddit really are bad.