Obfuscate Your Company

dpifke · on May 29, 2009

The solution proposed by this article (using MD5 hashes instead of numeric identifiers for primary keys) omits the point that strings are way more expensive to look up than integers. You're creating a bunch of extra work for the database on every single lookup.

A better solution is to use a global sequence rather than per-table sequences. That way your primary key is still numeric (and fast). A curious onlooker can guess how fast your database is growing, but they don't know whether that growth is users, orders, log messages, etc. etc.

edw519 · on May 29, 2009

"omits the point that strings are way more expensive to look up than integers"

So what? We're talking microseconds.

I take it a step further than OP. Every data base record has 2 IDs, an internal sequential one and an external one. The external one is cross-referenced to the internal one, which is never seen by human eyes. So every read is actually 2 reads, which is way more expensive still. I have been doing this for years and have never seen any noticeable degradation.

The reason I started this has nothing to do with security. It's so that I can change any key on any data base at any time without actually changing the primary key and without a conversion. (You'd be surprised how often you need to do this in commercial applications.)

dpifke · on May 29, 2009

Using database OIDs as publicly accessible identifiers is a neat idea, one I haven't tried.

Looking at the Postgres docs (http://www.postgresql.org/docs/8.0/static/datatype-oid.html) they discourage the use of OIDs, going so far as to say the default in future versions is not to create OIDs for user-created tables.

Also, this would mean that a dump/restore of your database results in invalidating all your identifiers.

What RDBMS do you do this on and have you run into any problems?

EDIT: Sure enough, in modern Postgres, these are disabled by default. Additionally, they appear to be sequentially assigned:

dev01=> select *,oid from test; test_id | test_string | oid ---------+-----------------+------- 1 | blah | 29837 2 | blahblahblah | 29838 3 | xxxblahblahblah | 29839

edw519 · on May 29, 2009

I implement double IDs in my own code, not in any built-in function of any DBMS. You're right to be suspicious.

jasonkester · on May 30, 2009

You can get around this by simply having a 2nd way to reference that single integer ID.

I've built systems in the past where we simply gave out base26Encode(ID+20000) whenever an ID was desplayed on the URL, which gives a nice typeable 4-digit string like "6cw8". Pulling up a record, you'd simply check whether you were looking at an integer and if not, (base26Decode(key)-20000) and you're in business.

It doesn't need to be rocket science crypto stuff. After all, it's just obfuscation to confuse the casual observer.

mindhacker · on May 30, 2009

Please correct me if I am wrong but doesn't using a global sequence mean all the threads wanting to insert a record have to queue up and wait for the db to provide the auto increment id? I am not sure what time this takes but may be it creates a bottleneck for really large, write heavy applications.

russss · on May 31, 2009

Most reasonable databases do this very quickly. Postgres specifically can hold a cache of a number of sequence values on a per-process basis so each insert doesn't have to acquire a global lock on the sequence.

The disadvantage of this is that sequences in Postgres will have gaps in (as sequence ids aren't reused if transactions are rolled back).

That said, one good reason why centralized global sequences are not ideal, especially in very large systems where consistency is not paramount, is that they tie you to a single point of failure. In those cases it's better to implement a distributed sequence generator (of which GUIDs are probably the simplest type).

teej · on May 29, 2009

If you only use the first 4-8 characters of your md5 hash, can't you just convert that from hex to base 10 and get an int?

dpifke · on May 29, 2009

You could, but then you need to implement collision avoidance when creating new records. Database sequences are guaranteed to be unique.

jeffiel · on May 29, 2009

dpifke, you're right that it's definitely slower and more space consuming to index and query 32 character strings vs. ints. However, it's probably inconsequential for most applications, especially if your queries are simple SELECTs and INSERTS, as most web apps are doing.

tptacek · on May 30, 2009

Pedantic: MD5 doesn't produce a 32-character string, it produces a 128 bit number. With full precision, you can represent it in 25 bytes base32. But since you don't need security, just uniformity and unpredictability, you can truncate that to whatever size you're comfortable with.

10ren · on May 30, 2009

It's a cute insight, but more for observers than doers. You're probably better off concentrating your energies on building something people want.

e.g. if you have only 4 customers, then working on that reality is more important that hiding it. However, your competitors would thank you for focusing on the latter.

drawkbox · on May 30, 2009

Another commenter already posted but UUIDs (most platforms) and GUIDs (Microsoft platforms) are the best unique key and no need to hash them. Most larger web products use them at least as a stamp or reference but many distributed systems (especially for sync) use them for simplicity but also globally uniqueness across many database domains.

http://tools.ietf.org/html/rfc4122.html

http://en.wikipedia.org/wiki/Universally_Unique_Identifier

jasonkester · on May 30, 2009

... as long as nobody ever has to type them in.

Sometimes, short and typeable is a bonus, and integers win there every time.

StrawberryFrog · on May 30, 2009

There is a good reason why a GUID/UUID or the like should never be used as a database primary key when records are regularly added.

The short answer is that the records are usually stored in the primary key order (clustered index) and every new record being inserted at a random location in the middle of this order rather than just added at the end causes problems.

See http://stackoverflow.com/questions/583001/improving-performa...

http://sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KE...

russss · on May 31, 2009

Fair point, but most DBs (including MySQL and Postgres) don't cluster their PKs by default - those posts are both referring to SQL Server which apparently does (and that's seemingly an implementation detail).

farmerbuzz · on May 29, 2009

How much does this really matter anyway? If your app is devoid of content or users won't it be obvious even without seeing a "low" auto-incrementing ID?

mattmaroon · on May 30, 2009

It certainly doesn't matter to your customers, of whom 99.999999999999% will never notice an id in a url.

And if you're about to IPO, just multiply all ids by 5 where customers might see them. You're not lying about how many customers you are getting if some analyst sees that, but you might get a few extra bucks per share.

ryanwaggoner · on May 30, 2009

You're only thinking about social apps. Think about something like Basecamp.

eli · on May 30, 2009

md5(rand())) doesn't really make much sense. It's slightly less random than just using rand().

babo · on May 30, 2009

But it has a constant length and hides randomness from the observer.

eli · on May 31, 2009

"Hide randomness" ?

babo · on May 31, 2009

From hashes you'll recognize the fingerprint of the hash function but not the underlying data. Here is an example of random UIDs, generating a few user's in a row and it's obvious that it's random: random.sample(xrange(10000), 10) [88, 3833, 6353, 8113, 8983, 4280, 7878, 3050, 9409, 404]

eli · on May 31, 2009

So?

And if you generated random IDs with the same number of bits as a hash function they would be indistinguishable.

Jakob · on May 30, 2009

The code in the article comes from the example of this page http://www.php.net/manual/en/function.uniqid.php which has an interesting comments section on unique ids.

Tichy · on May 30, 2009

Just occured to me that this would probably also apply to suggested user names with numbers: "'Jenny' is already taken, we suggest 'Jenny178'". Would be fun to work out the user numbers from that (proportion of Jennys in the population etc).

jasonkester · on May 30, 2009

This goes both ways. If your thing is actually good, those numbers will quickly climb to respectable levels.

Start a new Twiddla meeting today and you'll get a six-digit room ID. That's a piece of information I don't mind people finding out. It shows we've been around a while and that tons of people have been using it. Consider it a subtle form of marketing to those who speak AUTOINCREMENT.

That being said, I have in the past started with >1000 seeds for other services where I knew we wouldn't be getting the same sort of traction immediately.

Confusion · on May 30, 2009

I'd go the other way: just advertise the data you would also report to said analysts. Saves everyone trouble.

ilyak · on May 30, 2009

In Russia, when a book is printed, publisher have to state a lot of meta-data on the page first: date, number of copies printed, font, paper quality. It's mandatory, I think it's inherited from USSR.

My western books just do not carry such information. Which makes book examining a lot less fun! So! I think that users have right to see the correct meta-data, and that it should be in fact enforced somehow. Imagine a browser without an URL bar!

huhtenberg · on May 29, 2009

  For example, here's one way to generate a random, 
  fixed-length key in PHP:

  md5(uniqid(rand(), true))

Bad example. The obfuscation is weak and easy to guess. Just remember to salt your hashes, gentlemen, and you'll be fine (some conditions apply).

(edit) erm .. sorry, had a brain spasm and didn't notice the rand() .. :)

cstejerean · on May 29, 2009

If they're simply using the md5 of a random number as a unique identifier in order to hide the total number of records in the database I fail to see security implications of not using a salt.

likpok · on May 29, 2009

How is it easy to guess? The number it generates does not at all appear to be a function of the number of keys already assigned.

teej · on May 29, 2009

I've gotten in the habit of using:

    SHA1(unix_timestamp + "misc salt phrase" + rand )[0, 8]

It gives a nice, simple, 8 character id, pretty much not likely to create a collision in a space of 1M keys, and not guessable. Am i doing it wrong?

edw519 · on May 29, 2009

"pretty much not likely to create a collision"

Depends on your definition of "pretty much not likely". It may be far more likely than you suspect. Remember that in a group of 30 people, there's a 50% chance 2 of them have the same birthday. Not exactly intuitive.

Also remember that as your data base grows, the probability of collisions grows geometrically.

Here's hoping that your app does so well that you'll need millions of IDs.

teej · on May 30, 2009

> Remember that in a group of 30 people, there's a 50% chance 2 of them have the same birthday. Not exactly intuitive.

It's closer to 70%. You're very much correct, with my example, the chance of a collision is very, very high. I recalculated the odds, and by increasing the ID length to 12 (still a manageable size from a human readable standpoint), the odds of hitting a collision from 1M ids drops down to 0.1%.

You can also just use sequences like someone mentioned earlier or seed your hash with something checked for uniqueness in advance, like a username. There seem to be plenty of ways of doing this right, just not my way!

jhancock · on May 29, 2009

this is just fine. you are only at risk if your db libs/frameworks have no checks for if collision does some day occur. For example, as an id being a primary key in a db table, your db will cough up an error if you get collision, but will your ruby/python/blub libs/frameworks handle it gracefully?

CalmQuiet · on May 30, 2009

Is it not a simple thing to require "unique" of the table column? ...and then if the occasional "error" occurs in attempt to create a new row in table have your framework retry until your randomizing routine succeeds in generating an unused ID? Or am I missing something?

eli · on May 30, 2009

why not just timestamp + rand? (And make rand as many digits as needed to make it as long as you want)