Hacker News new | past | comments | ask | show | jobs | submit login

I really liked this article but I feel it misses one somewhat important point about using incremental numbers: They are trivially guessable and one needs to be very cautious when exposing them to the outside world.

If you encounter some URL like https://fancy.page/users/15 chances are that the 15 is a numeric ID and 1 to 14 also exist. And the lower numbers tend to be admin accounts as they are usually created first. This might be used by an attacker to extract data or maybe gain access to something internal. One could argue that using UUIDs only hides a security hole in this case but thats better than nothing I guess.




Beyond the obvious and important security implications of incremental numbers, there is one other major problem with them.

They make life hell for database clustering, merges and migrations.

In addition, on a more minor level, in a client-centric (apps, browser JS etc) world, the use of incremental numbers is an un-necessary pain point. If you use UUIDs, the client can generate its own without the need to a call back to the API (unless necessary in context, obviously).

Frankly, IMHO in the 21st century, the use of incremental numbers for IDs in databases thoroughly deserves to be consigned to the history books. The desperate clutching at straws arguments that went before (storage space, indexing etc.) are no longer applicable in the modern database and modern computing environment.


This greatly overstates the benefits of UUIDs, and ignores the myriad ways in which they demonstrably have poor properties in real systems. I've worked on several large systems designed around UUIDs at scale that we had to later modify to not use UUIDs due to their manifest issues in practice. And then there is the reality that v3/4/5 UUIDs are expressly prohibited in some application contexts due to their defects and weaknesses.

Also, sequence generators are a non-problem in competent architectures, since you can trivially permute the number such that it is both collision-free and non-reversible (or only reversible with a key).

It is still common to use structured 128-bit keys in very large scale databases, and it is good practice in some cases, but these are not UUIDs because the standardized versions all have problems that can't be ignored. This leads to the case where there are myriad non-standard UUID-like identifiers (in that they are 128-bit) but not interoperable.


> I've worked on several large systems designed around UUIDs at scale that we had to later modify to not use UUIDs due to their manifest issues in practice.

As I said in my later comment, let's put the "don't use UUID because high scale" argument to one side, shall we ?

Because per-later comment:

    1) Vast majority of database implementations are not remotely "high scale". Most database implementations could use UUIDs and nobody would ever notice.
    2) "High scale" brings specific environment concerns, not only related to databases
Seeking to bring "high scale" mentality lower down the foodchain only results in people implementing premature optimisation (and vastly overspending on over-complex cloud implementations).


Is it really true that concerns around UUIDs as primary keys are wholly irrelevant? Maybe I'm working off outdated information but in high scale environments there are a lot of downsides primarily related to the random write patterns into B-trees causing page splitting and things like that.


> Is it really true that concerns around UUIDs as primary keys are wholly irrelevant?

I would say yes, with the options we have today with modern compute.

We live in a world where compute is powerful enough to enable Let's Encrypt to issue SSL certificates for 235 million websites every 90 days off the back of a single MySQL server[1].

For high scale environments there are also other options such as async queues and Redis middleware.

Database technology itself is also evolving, and the degree of measurable downside is less than it might have been 10 years ago.

I would still argue that for the vast majority of people, UUIDs are the way to go. I would certainly urge caution against premature optimisation involved with the "but high scale" argument. Sure things MIGHT be noticeable at high scale, but I think its fair to say most people are not doing anywhere enough high scale to do so and should probably just use UUIDs and cross the "high scale" bridge if/when they ever come to it.

Finally, its also worth pointing out that all the hyperscalers use UUIDs or other unique identifiers widely in their infrastructure and APIs, all of which must inevitably be tied into a database backend.

[1]https://letsencrypt.org/2021/01/21/next-gen-database-servers...


You're right that random unordered writes are worst case for an indexed (ordered) key. ULID and ordered UUIDs (v6+) help solve this.

For dimensions, UUIDs are usually fine since writes are infrequent. For facts or timeseries data, ordered IDs are more efficient.


Yes, ordered UUIDs help solve this. The unfortunate deep dive rabbit hole here is how UUIDs are sorted in different databases and making sure your UUID generation matches that.

One fun for instance I worked directly with: Microsoft's SQL Server made some interesting assumptions based on UUID v1 and sorts the last six bytes first. In UUIDv1 those would have been the MAC addresses and clustering by originating machine first has some sort of sense to it in terms of ordered writes. The ULID timestamp is coincidentally also six bytes (48 bits) so (ignoring the Endian issues of the other "fields" in the UUID) you can get Microsoft's SQL Server to order UUIDs in mostly the same way as their ULID representation by just transposing the first six bytes to be the last six bytes.

Unfortunately UUID v6+ won't sort well in Microsoft SQL Server's sort order today.

Other databases will vary on what you need to do to get sortable UUIDs.

A reference to me on all of this deep rabbit hole was Raymond Chen's blog on the many GUID/UUID sort orders just in Microsoft products: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...

(With the fun punchline at the bottom being the link to the Java sort order. My sympathies to anyone trying to sort UUIDs in an Oracle database.)


I speak from bitter experience - UUID for PKs is not a good idea out of the box. Both writing as well as reading took very significant penalties. I did not design that particular system, but I had to figure out why relatively simple queries took minutes to return.


> I did not design that particular system ... why relatively simple queries took minutes to return

The road of databases is paved with many such bodies.

Whether it is developers treating databases like some blackbox dumping ground, or designing generic "portable" schemas, or people who don't know SQL writing weird long convoluted queries.

Many people are quick to blame "the database", but 99% of the time its the fault of those who designed the schema and/or the queries that run on it.

I think your statement "UUID for PKs is not a good idea out of the box" is unfair and too broad brush. Without knowing the exact details of every bit of your environment (from database hardware upwards), its not possible to accept such a generic statement to be read as a fact.


There have been / still are tons of attacks where you can see other people's data by just incrementing and decrementing the ID in the URL.

Will see if I can a section about security implications, there's a similar time based argument to be made for ULIDs as well — you don't inadvertently want to expose a timestamp in some cases.


I once discovered by accident that a big hospital in my big city used incremental IDs for loading exams results (one of my exams wasn't loading while the others were, so I just opened dev tools and 2 minutes later I noticed I could access 500_000 exams of random people just by changing something like /exam/ID).

UUIDs could have prevented the leak even if they still managed to completely disregard any authentication logic on the backend.


It would be similarly feasible for a competent person to do the same for UUIDs, at least the RFC4122 timestamp-derived UUIDs which many people and libraries use. Of the 128 bit field, several sections are constant over variables like (i) the identity of the machine and (ii) the process, and the RFC describes exactly what those fields are. For the timestamp field, you would then guess at timestamps near to the original UUID's timestamp.

It's not as easy as incremental IDs, without doubt, but it's worth correcting the idea that (most) UUIDs are designed to provide security in this situation, beyond maybe a quarter-layer of defence in depth. In fact, the RFC explicitly says:

> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities (identifiers whose mere possession grants access), for example.

https://datatracker.ietf.org/doc/html/rfc4122#section-6


Very true and important to state, UUIDs on their own at most provide obscurity, not security. Can the MAC address of the host that is used for some versions be extracted/read from the UUID or maybe inferred by observing a number of UUIDs?


Yup, it can absolutely be extracted. It's not hashed or anything like that, it's just a sequence of fields in the order that the spec gives. It's really not even 'extract', it's more just 'read'.

I think people may be misled by the fact that UUIDs are frequently hex-encoded prior to being sent over the wire (or even, stupidly, in the database). It looks like a hash, but it's very much not one.

Edit: This is all referring to RFC 4122, to be clear. It's entirely possible that there are some other UUID schemes out there which do hash their contents.


Definitely didn't know that, thanks for that insight, really appreciate it! I always just assumed they were hashed but never really bothered to check. V4 shouldn't have this problem, right?


That can't be achieved if UUIDV4 is used.


This is the problem with incremental numbers:

https://en.wikipedia.org/wiki/German_tank_problem


More trivially, it also gives insights into your business if I can determine the upper bound of your resources by trial-and-error guessing. If the highest user ID is 94, that may be an (hopefully unwarranted!) red flag to potential customers or investors.


Using https://hashids.org is sensible for mapping to external ID.


I recently learned about this. We're thinking of using them with our project really soon. Are there any gotchas to be aware of with it?


The OP proposes using `ULID`s, which are the same number of bytes as UUIDs, but have an initial timestamp component (ms since epoch), plus a subsequent random component. While these are sequential (not exactly "incremental"), so give two of them you can know which came first -- they aren't really "guessable", as you'd need to guess not only an exact timestamp (not infeasible if more of a challenge than with incremental integers), but a large subsequent random component (infeasible).

Apparently there are some proposals to make official UUID variants with this sort of composition too, which some threads in this discussion go into more detail on.


They aren't guessable, except for ULIDs generated by the same process in the same millisecond. To keep chronological order even within the same timestamp, ULIDs within generated within the same millsecond become incremental. This can become relevant for example when an attacker requests a password reset for himself and the victim simultaneously.


Interesting, I learned about ULIDs for the first time from this article, which says: "The remaining 80 bits [in a ULID] are available for randomness", which I read as saying those last 80 (non-timestamp) bytes were random, not incremental. But this was misleading/I got the wrong idea?

Going to the spec [1]... Yeah, that's weird. The spec calls those 80 bytes "randomness", and apparently you are meant to generate a random number for the first use within a particular ms... but on second and subsequent uses you need to increment that random number instead of generating another random number?

Very odd. I don't entirely understand the design constraints that led to a non-random "randomness" section still called "randomn" in the spec even though they are not.

[1]: https://github.com/ulid/spec


From what I gather this is done to persist the sort order. All calls within the same millisecond will get the same timestamp component so that can't be used to sort the ULIDs. So the "random" part is incremented and the resulting ULIDs can still be sorted by the order of function calls. This wouldn't be possible if the random part were truly random. I'm not sure this is a good idea but that is what I understood from the spec.


It’s not clear why they don’t just use a finer-grained timestamp component (e.g. nanoseconds) and increase that number if two events are within the same clock interval, and keep the random component random.


The reference implementation is written in Javascript and I don't think that provides a reliable way to get timestamps this fine grained.


The underlying clock doesn’t need to be fine-grained; only the target representation needs to be fine-grained enough for the additional increments inserted by the code that generates the IDs. Effectively, the precision of the timestamp part needs to be large enough to support the number of IDs per second you want to be able to generate sustainably.


Also, the spec notes there is an entropy trade off in using more bits for the time stamp. More timestamp bits is fewer random bits, because ULIDs constrained themselves to a combined total 128 bits (to match GUID/UUID width).


This is often dealt with trivially using collision-free hashing. It exports a number in the same domain as the sequence (i.e. 64-bit sequence -> 64-bit id) but is not reversible and is guaranteed to be unique.


This is true with identifiers which are already random, but unless you’re doing something like keyed hashing, a naive implementation of say SHA256(predictable_id) isn’t going to solve this problem against a determined attacker, but I’d like to learn a bit more about what you’re discussing here.


For example, running the sequence generator through AES rounds. This is keyed, very fast if the hardware supports it, and permutations of types smaller than the block size are collision-free and non-reversible[0] if the AES block is setup suitably.

In other cases, a 128-bit key is simply encrypted (conveniently being the same block size as AES), which allows you to put arbitrary structure inside the exported key.

[0] http://www.jandrewrogers.com/2019/02/12/fast-perfect-hashing...


I was not aware of this!

Do you have a favorite link for more information on this?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: