Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Obfuscate Your Company (twilio.com)
53 points by Derferman on May 29, 2009 | hide | past | favorite | 38 comments


The solution proposed by this article (using MD5 hashes instead of numeric identifiers for primary keys) omits the point that strings are way more expensive to look up than integers. You're creating a bunch of extra work for the database on every single lookup.

A better solution is to use a global sequence rather than per-table sequences. That way your primary key is still numeric (and fast). A curious onlooker can guess how fast your database is growing, but they don't know whether that growth is users, orders, log messages, etc. etc.


"omits the point that strings are way more expensive to look up than integers"

So what? We're talking microseconds.

I take it a step further than OP. Every data base record has 2 IDs, an internal sequential one and an external one. The external one is cross-referenced to the internal one, which is never seen by human eyes. So every read is actually 2 reads, which is way more expensive still. I have been doing this for years and have never seen any noticeable degradation.

The reason I started this has nothing to do with security. It's so that I can change any key on any data base at any time without actually changing the primary key and without a conversion. (You'd be surprised how often you need to do this in commercial applications.)


Using database OIDs as publicly accessible identifiers is a neat idea, one I haven't tried.

Looking at the Postgres docs (http://www.postgresql.org/docs/8.0/static/datatype-oid.html) they discourage the use of OIDs, going so far as to say the default in future versions is not to create OIDs for user-created tables.

Also, this would mean that a dump/restore of your database results in invalidating all your identifiers.

What RDBMS do you do this on and have you run into any problems?

EDIT: Sure enough, in modern Postgres, these are disabled by default. Additionally, they appear to be sequentially assigned:

dev01=> select *,oid from test; test_id | test_string | oid ---------+-----------------+------- 1 | blah | 29837 2 | blahblahblah | 29838 3 | xxxblahblahblah | 29839


I implement double IDs in my own code, not in any built-in function of any DBMS. You're right to be suspicious.


You can get around this by simply having a 2nd way to reference that single integer ID.

I've built systems in the past where we simply gave out base26Encode(ID+20000) whenever an ID was desplayed on the URL, which gives a nice typeable 4-digit string like "6cw8". Pulling up a record, you'd simply check whether you were looking at an integer and if not, (base26Decode(key)-20000) and you're in business.

It doesn't need to be rocket science crypto stuff. After all, it's just obfuscation to confuse the casual observer.


Please correct me if I am wrong but doesn't using a global sequence mean all the threads wanting to insert a record have to queue up and wait for the db to provide the auto increment id? I am not sure what time this takes but may be it creates a bottleneck for really large, write heavy applications.


Most reasonable databases do this very quickly. Postgres specifically can hold a cache of a number of sequence values on a per-process basis so each insert doesn't have to acquire a global lock on the sequence.

The disadvantage of this is that sequences in Postgres will have gaps in (as sequence ids aren't reused if transactions are rolled back).

That said, one good reason why centralized global sequences are not ideal, especially in very large systems where consistency is not paramount, is that they tie you to a single point of failure. In those cases it's better to implement a distributed sequence generator (of which GUIDs are probably the simplest type).


If you only use the first 4-8 characters of your md5 hash, can't you just convert that from hex to base 10 and get an int?


You could, but then you need to implement collision avoidance when creating new records. Database sequences are guaranteed to be unique.


dpifke, you're right that it's definitely slower and more space consuming to index and query 32 character strings vs. ints. However, it's probably inconsequential for most applications, especially if your queries are simple SELECTs and INSERTS, as most web apps are doing.


Pedantic: MD5 doesn't produce a 32-character string, it produces a 128 bit number. With full precision, you can represent it in 25 bytes base32. But since you don't need security, just uniformity and unpredictability, you can truncate that to whatever size you're comfortable with.


It's a cute insight, but more for observers than doers. You're probably better off concentrating your energies on building something people want.

e.g. if you have only 4 customers, then working on that reality is more important that hiding it. However, your competitors would thank you for focusing on the latter.


Another commenter already posted but UUIDs (most platforms) and GUIDs (Microsoft platforms) are the best unique key and no need to hash them. Most larger web products use them at least as a stamp or reference but many distributed systems (especially for sync) use them for simplicity but also globally uniqueness across many database domains.

http://tools.ietf.org/html/rfc4122.html

http://en.wikipedia.org/wiki/Universally_Unique_Identifier


... as long as nobody ever has to type them in.

Sometimes, short and typeable is a bonus, and integers win there every time.


There is a good reason why a GUID/UUID or the like should never be used as a database primary key when records are regularly added.

The short answer is that the records are usually stored in the primary key order (clustered index) and every new record being inserted at a random location in the middle of this order rather than just added at the end causes problems.

See http://stackoverflow.com/questions/583001/improving-performa...

http://sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KE...


Fair point, but most DBs (including MySQL and Postgres) don't cluster their PKs by default - those posts are both referring to SQL Server which apparently does (and that's seemingly an implementation detail).


How much does this really matter anyway? If your app is devoid of content or users won't it be obvious even without seeing a "low" auto-incrementing ID?


It certainly doesn't matter to your customers, of whom 99.999999999999% will never notice an id in a url.

And if you're about to IPO, just multiply all ids by 5 where customers might see them. You're not lying about how many customers you are getting if some analyst sees that, but you might get a few extra bucks per share.


You're only thinking about social apps. Think about something like Basecamp.


md5(rand())) doesn't really make much sense. It's slightly less random than just using rand().


But it has a constant length and hides randomness from the observer.


"Hide randomness" ?


From hashes you'll recognize the fingerprint of the hash function but not the underlying data. Here is an example of random UIDs, generating a few user's in a row and it's obvious that it's random: random.sample(xrange(10000), 10) [88, 3833, 6353, 8113, 8983, 4280, 7878, 3050, 9409, 404]


So?

And if you generated random IDs with the same number of bits as a hash function they would be indistinguishable.


The code in the article comes from the example of this page http://www.php.net/manual/en/function.uniqid.php which has an interesting comments section on unique ids.


Just occured to me that this would probably also apply to suggested user names with numbers: "'Jenny' is already taken, we suggest 'Jenny178'". Would be fun to work out the user numbers from that (proportion of Jennys in the population etc).


This goes both ways. If your thing is actually good, those numbers will quickly climb to respectable levels.

Start a new Twiddla meeting today and you'll get a six-digit room ID. That's a piece of information I don't mind people finding out. It shows we've been around a while and that tons of people have been using it. Consider it a subtle form of marketing to those who speak AUTOINCREMENT.

That being said, I have in the past started with >1000 seeds for other services where I knew we wouldn't be getting the same sort of traction immediately.


I'd go the other way: just advertise the data you would also report to said analysts. Saves everyone trouble.


In Russia, when a book is printed, publisher have to state a lot of meta-data on the page first: date, number of copies printed, font, paper quality. It's mandatory, I think it's inherited from USSR.

My western books just do not carry such information. Which makes book examining a lot less fun! So! I think that users have right to see the correct meta-data, and that it should be in fact enforced somehow. Imagine a browser without an URL bar!


  For example, here's one way to generate a random, 
  fixed-length key in PHP:

  md5(uniqid(rand(), true))
Bad example. The obfuscation is weak and easy to guess. Just remember to salt your hashes, gentlemen, and you'll be fine (some conditions apply).

(edit) erm .. sorry, had a brain spasm and didn't notice the rand() .. :)


If they're simply using the md5 of a random number as a unique identifier in order to hide the total number of records in the database I fail to see security implications of not using a salt.


How is it easy to guess? The number it generates does not at all appear to be a function of the number of keys already assigned.


I've gotten in the habit of using:

    SHA1(unix_timestamp + "misc salt phrase" + rand )[0, 8]
It gives a nice, simple, 8 character id, pretty much not likely to create a collision in a space of 1M keys, and not guessable. Am i doing it wrong?


"pretty much not likely to create a collision"

Depends on your definition of "pretty much not likely". It may be far more likely than you suspect. Remember that in a group of 30 people, there's a 50% chance 2 of them have the same birthday. Not exactly intuitive.

Also remember that as your data base grows, the probability of collisions grows geometrically.

Here's hoping that your app does so well that you'll need millions of IDs.


> Remember that in a group of 30 people, there's a 50% chance 2 of them have the same birthday. Not exactly intuitive.

It's closer to 70%. You're very much correct, with my example, the chance of a collision is very, very high. I recalculated the odds, and by increasing the ID length to 12 (still a manageable size from a human readable standpoint), the odds of hitting a collision from 1M ids drops down to 0.1%.

You can also just use sequences like someone mentioned earlier or seed your hash with something checked for uniqueness in advance, like a username. There seem to be plenty of ways of doing this right, just not my way!


this is just fine. you are only at risk if your db libs/frameworks have no checks for if collision does some day occur. For example, as an id being a primary key in a db table, your db will cough up an error if you get collision, but will your ruby/python/blub libs/frameworks handle it gracefully?


Is it not a simple thing to require "unique" of the table column? ...and then if the occasional "error" occurs in attempt to create a new row in table have your framework retry until your randomizing routine succeeds in generating an unused ID? Or am I missing something?


why not just timestamp + rand? (And make rand as many digits as needed to make it as long as you want)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: