Sunday, January 16, 2011

Natural vs. Surrogate Keys, that is the question

The question had been discussed many times in database forums, especially SQL Server forums. The latest article about the topic was published month ago on SQL-Server-Performance.com. I have pretty strong opinion, that surrogate keys are clearly better choice in almost any case. In a few cases, where you might be no wrong using natural key, you can also use surrogate key without regretting your decision in the future, so why bother even considering natural keys. That is the choice you would regret 99 times out of 100, so it is better not to spend time considering them.

After I stated my position, I am going to discuss specific points from the article mentioned I disagree with, even though I think the article is well written overall...
Which side is right?  Neither.  Both natural and surrogate keys have their own place, and a developer who doesn’t use both as the situation demands is shortchanging both himself and his applications.
...but as you can see above, the conclusion is wrong IMHO.
The value of a surrogate must never be exposed to the outside world.   Users should never be allowed to see the key, under any conditions.   Display the value of a surrogate key on a report, allow it to be viewed on a form or even used as a search term – these are all forbidden.   Once you expose a surrogate key, it immediately begins acquiring business meaning. 
I disagree. As long as you make sure the key value never change you can expose it. Usually you don't do that, but I don't think you must not. I might be wrong here, but I would like to see an example showing a disaster happening when you expose the value of surrogate key.
Note: if your data already contains meaningful product codes or other keys such as those described above, then they are simply natural keys and the above caveat doesn’t apply.     It’s a smart key only when the value is constructed by the developer.
I agree smart key is a huge mistake. However, smart keys are sometimes created by business people, not developers. Pharmaceutical company Galenika Belgrade used to to have smart codes of their products. Donna Karan New Your does the same. Why it is bad to use such codes as natural keys? Galenika in 90s had 10s of thousands maybe even over 100 thousands products and decided to change the structure of their smart key. Can you imagine disaster happened in their transaction database and all distributed databases from their partners? The consequence of using natural product primary key was having that smart product code in each single product transaction, history tables, link tables etc.
Immutability is certainly a desirable feature for a key, but it’s by no means a requirement.   Using this as an excuse is the trademark of a lazy developer.  Data changes.   If the key changes, update it.  If it’s being used as a foreign key in another table – update that table too.  Cascading updates exist for a reason, after all.
Very wrong and dangerous! The reason is mentioned in the following paragraph:
Obviously if a key changes very often, or will be used as a FK for many other tables, there can be performance or concurrency implications to making it the primary key .  In this case, you do want toconsider a surrogate, and use it if appropriate. 
The problem in Galenika was not that a key was changed very often. It was enough that its structure changed once and for all products. I don't know in how many tables that code was foreign key, however the main problem, and huge one itself, was that it was foreign key in sales transactions table. So, my recommendation is not to consider a surrogate, but to use a surrogate and not even think about "natural" key. Even if you don't have "FK for many other tables", one can be bad enough and even if it is not big deal from the beginning it can become disaster later. I  saw too many applications developed as a small tool for single user that evolved into enterprise wide application that was not designed as such.
Replacing a wide key with a narrower value means smaller indexes, and more values retrieved from every index page read.  This does boost performance.  However,  you’ll usually retain the index on the natural key (to enforce uniqueness if nothing else) and that means another index to maintain.  If your table is very narrow, the additional column for the surrogate can noticeably impact performance and storage requirements.
This is really not a problem. For tables containing transactions, usually there is no "natural" key or the "natural" primary key is multi-column key, so even according to the author, surrogate key should be used. If the  table contains reference data, its size is is insignificant, so one more index  is not an issue. However, wider index on  FK column in transaction table is an issue.

To finish, the main problem with "natural" keys is they are subject to change, usually wide, often composite and when changes happen you have to waste a lot of resources for updates all over database. Formally, "natural" keys do not break any normal form, but they break one of main reasons normal forms were introduced. Instead of changing one piece of info in one place, you have to do it on many others.

The primary key is used in RDBMS to identify row and establish relationship. If you overload the primary key with additional information (value of natural key), whenever you change a value of primary key column, you have to update all related rows, even though the intention was not to change relationship between rows in master and child table. That is the reason why the value of primary key column should be meaningless.

No comments:

Post a Comment