Wednesday, May 25, 2011

To be and not to shrink, that is the answer

I worked as IT manager for local cell phone operator in Czech Republic, that was acquired by Vodafone a few years later.

It was second week on my new job when the proprietary selfcare system became unresonpsive. The web part was php, there was a lot of integration with sms centre and IVR but the reason for trouble were sudden connection timeouts from Microsof SQL Server 2000. Support and maintenance people told me that it happens from time to time and they think the reason is Microsoft SQL Server driver for IVR that is causing trouble. They recommended to migrate database layer to Oracle, known as as much more reliable rdbms. I asked how they were solving timeout problems before and the answer was: "Nothing, whatever we do (restarting different components, servers or doing nothing) after 10 minutes or so, the system starts working as if nothing happened."

The answer made me asking for sa password, I was feeling a smell of my old enemy. The application programmer, who played role of DBA, switched on auto-shrink of self-care database.

So, why is auto-shrink bad? Actually, any shrink is bad. When you shrink database, you release the space database needed at one point in time. If it needed once, it will need again, since databases grow over the time. Even if they do not grow (in case there is archiving procedure in place that keeps the size of database with just "current" data constant), there are tasks requiring extra (free) space, like rebuilding indexes, large data imports, monthly billing compilations, ETL jobs etc. When you shrink database you reduce the database size, so next time that space is needed, database has to expand, and it usually does at the time of heavy activity causing timeouts we experienced in Vodafone.

I prepared with our "DBA" the following plan:

1. We expanded the size of self-care database, so it can handle maintenance tasks, re-indexing and big data imports without expansion.

2. We scheduled frequent enough transaction log backup in addition to  regular database backups. When recovery model is full, database server can reuse transaction log space where committed transactions are stored, only after they are backed up and marked free.

3. We set up alert that informs DBA when space used is close to the database size.

4. When the free space becomes low, dba will schedule  short downtime in off-peak hours when database will be enlarged.

Sunday, April 10, 2011

Rows matching all the values from a list

Result that matches all the values from a list is title of recent discussion on SQL Team.

Someone asked for query that returns Ids of products having both properties with id 20 and 23. The sample of data is in the following table:


IdProduct
IdProperty
19
19
19
23
20
20
20
23

Three solutions were offered.

The first solution uses only group by and having
SELECT IdProduct
FROM tbPropertyProduct
GROUP BY IdProduct
HAVING COUNT(DISTINCT CASE WHEN IdProperty IN (20, 23) THEN IdProperty END) = 2;

The second solution, filters selects rows having either property 20 or 23, then using group by finds products having both properties.
SELECT IdProduct
  FROM Table
 WHERE IdProperty IN (20, 23)
 GROUP BY IdProduct
HAVING COUNT(DISTINCT IdProperty) = 2;
 The third solution finds rows with one property and among them selects only products having another property too.
select p1.IdProduct
  from tbPropertyProduct p1
  join tbPropertyProduct p2 on p1.idProduct = p2.idProduct 
  where p1.idProperty = 20 and p2.idProperty = 23
Calculation of number of rows each solution will access follows.

N is number of rows in the table. p20 is number of products with property 20, p23 is number of products with property 23. Let me assume that p20 <= p23 <= N. Under normal circumstances, it is reasonable to expect that p20 < p23 << N.

The first index configuration is a composite index on IdProperty and IdProduct in that order.

The first solution would access all rows in the index, group them, count distinct properties and ideally eliminate products as soon as if finds the third property for given product id.
It means it would access all N index entries.

The second solution would access only entries with properties 20 and 23, and count number of distinct properties for each product. The number of index items accessed is p20 + p23 in that case.

The third query would ideally access index items with IdProperty = 20 and for each of them try to find the entry with the same product id and IdProperty = 23. The number of rows would be between p20 and p20 + p23, on average p20 + p20 * (p23/N) which is equal to p20 + p23*(p20/N).

If there is only composite index on IdProduct and IdProperty in that order, all 3 solutions would access  all index entries, however amount of processing after entries are accessed would be different. It is clear that the first query would need to do more then second one, because the second one would eliminate all entries with inappropriate property without counting. However, it is not clear which query, 2nd or 3rd would be more efficient. When deciding which one to use, you need to test it against your data and see which one works better.

Thursday, March 3, 2011

My wife and SQL cursor

It is incredible amount of luggage my wife forced us to take with us on three weeks vacation. It was problem moving it around and we actually did not use half of it. When we came home at eight in the morning we were already tired and nervous. It was time our neighbors go to work and it happened that a new neighbor came with truck a half an hour ago and started moving furniture to his flat on the second floor. There are two floors and basement on our building with three flats on each floor. Our flat is on the first floor. Can you imagine my joy with perspective of carrying that entire luggage to the first floor from garage, bumping into all neighbors either carrying furniture or hurrying to work? 

When I took the biggest suitcase the discussion started and it was quickly escalated to yelling. Yes, I took the heaviest suitcase, but only one instead of all at once. 

My wife hates taking one piece at time because:
a) It will take much more time then carrying all at once.
b) She is irritated with prospect going up and down so many times and waiting for furniture people in the process.
c) Her legs were swollen and again she did not want to go up and down so many times.

I did not want to carry all at once because:
  a) I was simply not strong enough and my back would be hurt if I even tried. 
  b) I would block all my neighbors in the process. 
  c) Neighbors going to work were already frustrated with the time new neighbor started moving furniture. 

Finally my wife concluded that she did not mean that I should carry everything by myself. Her idea was that she would also take as much as she can and our small son too. However, it was still unrealistic to take everything in one turn. We agreed that everyone will take as much as possible without hurting back.I had to go up and down three times with the heaviest luggage, but it was optimal approach IMO.

When we finally finished, I started talking about row by row processing, data set approach, processing huge volume of data in batches and incorporating multiple process to do the job in parallel. My wife thought I got crazy, but I hope you got the idea what I was talking about. So, in the next post(s) I will discuss these approaches, advantages, drawbacks and when it is good idea to apply them.

Wednesday, February 2, 2011

Running Totals in SQL, Back to the Future

I have just discovered a very good article and great discussion about implementing running totals in T-SQL. The discussion is over 200 posts long, but it is worth reading from the beginning to the end with exception of few posts discussing personal issues. Both the discussion and the article are full of impressive investigations and solutions. I wrote in 2006 an article  about the same topic. I must admit that my investigation and solution were not quite as good as I saw in the "future" article. Funny thing about the article is that it is dated in the future (2011/03/04), for a reason I did not quite understand.


Anyway, the purpose of my article was to show that it is better to solve cursor friendly problems on client side then in SQL. If you solve it in SQL you have to choose between slow, resource intensive solution and not quite reliable one. In my opinion, the article by Jeff Moden and the discussion about it confirmed my conclusions from five years ago.

Sunday, January 16, 2011

Natural vs. Surrogate Keys, that is the question

The question had been discussed many times in database forums, especially SQL Server forums. The latest article about the topic was published month ago on SQL-Server-Performance.com. I have pretty strong opinion, that surrogate keys are clearly better choice in almost any case. In a few cases, where you might be no wrong using natural key, you can also use surrogate key without regretting your decision in the future, so why bother even considering natural keys. That is the choice you would regret 99 times out of 100, so it is better not to spend time considering them.

After I stated my position, I am going to discuss specific points from the article mentioned I disagree with, even though I think the article is well written overall...
Which side is right?  Neither.  Both natural and surrogate keys have their own place, and a developer who doesn’t use both as the situation demands is shortchanging both himself and his applications.
...but as you can see above, the conclusion is wrong IMHO.
The value of a surrogate must never be exposed to the outside world.   Users should never be allowed to see the key, under any conditions.   Display the value of a surrogate key on a report, allow it to be viewed on a form or even used as a search term – these are all forbidden.   Once you expose a surrogate key, it immediately begins acquiring business meaning. 
I disagree. As long as you make sure the key value never change you can expose it. Usually you don't do that, but I don't think you must not. I might be wrong here, but I would like to see an example showing a disaster happening when you expose the value of surrogate key.
Note: if your data already contains meaningful product codes or other keys such as those described above, then they are simply natural keys and the above caveat doesn’t apply.     It’s a smart key only when the value is constructed by the developer.
I agree smart key is a huge mistake. However, smart keys are sometimes created by business people, not developers. Pharmaceutical company Galenika Belgrade used to to have smart codes of their products. Donna Karan New Your does the same. Why it is bad to use such codes as natural keys? Galenika in 90s had 10s of thousands maybe even over 100 thousands products and decided to change the structure of their smart key. Can you imagine disaster happened in their transaction database and all distributed databases from their partners? The consequence of using natural product primary key was having that smart product code in each single product transaction, history tables, link tables etc.
Immutability is certainly a desirable feature for a key, but it’s by no means a requirement.   Using this as an excuse is the trademark of a lazy developer.  Data changes.   If the key changes, update it.  If it’s being used as a foreign key in another table – update that table too.  Cascading updates exist for a reason, after all.
Very wrong and dangerous! The reason is mentioned in the following paragraph:
Obviously if a key changes very often, or will be used as a FK for many other tables, there can be performance or concurrency implications to making it the primary key .  In this case, you do want toconsider a surrogate, and use it if appropriate. 
The problem in Galenika was not that a key was changed very often. It was enough that its structure changed once and for all products. I don't know in how many tables that code was foreign key, however the main problem, and huge one itself, was that it was foreign key in sales transactions table. So, my recommendation is not to consider a surrogate, but to use a surrogate and not even think about "natural" key. Even if you don't have "FK for many other tables", one can be bad enough and even if it is not big deal from the beginning it can become disaster later. I  saw too many applications developed as a small tool for single user that evolved into enterprise wide application that was not designed as such.
Replacing a wide key with a narrower value means smaller indexes, and more values retrieved from every index page read.  This does boost performance.  However,  you’ll usually retain the index on the natural key (to enforce uniqueness if nothing else) and that means another index to maintain.  If your table is very narrow, the additional column for the surrogate can noticeably impact performance and storage requirements.
This is really not a problem. For tables containing transactions, usually there is no "natural" key or the "natural" primary key is multi-column key, so even according to the author, surrogate key should be used. If the  table contains reference data, its size is is insignificant, so one more index  is not an issue. However, wider index on  FK column in transaction table is an issue.

To finish, the main problem with "natural" keys is they are subject to change, usually wide, often composite and when changes happen you have to waste a lot of resources for updates all over database. Formally, "natural" keys do not break any normal form, but they break one of main reasons normal forms were introduced. Instead of changing one piece of info in one place, you have to do it on many others.

The primary key is used in RDBMS to identify row and establish relationship. If you overload the primary key with additional information (value of natural key), whenever you change a value of primary key column, you have to update all related rows, even though the intention was not to change relationship between rows in master and child table. That is the reason why the value of primary key column should be meaningless.

Sunday, January 9, 2011

Selling is a lot like fishing

You can not expect that just putting a hook in nearby water is enough to be actuall successful in fishing. You have to learn about the habbits, preferences and behaviors of the specific fish you want to catch. Your fish may not ever come to the place where your hook is, even if you cast your lure frequently. You first need to see where your fish swim while looking for a food. Off course an un-baited hook would not do the trick. You have to have the right bait the specific fish is looking for at that time. And that depends on the season, weather conditions and other environmental factors. The influence of environmental factors on fish behavior can change over the time.

When you are in the right place at the right time, you can expect moderate success. Your cat will be happy all the time and maybe even your wife who is wating for fresh fish for dinner that you promised. However, it might not be enough to brag about your success and offer dinner to your neighbours.

The  fish is not checking only in the place your hook is, so you need to attract the fish to investigate your location often and even to gather at the time you are fishing. You need to use a proper lure frequently enough at the time you are fishing. In some cases it is even possible to have your lure at the place days in advance, which would attract fish to gather regularly at "your" place.

The thicknes of the line, the size and shape of the hook, the fishing rod and the rest of the fishing equipment will greatly influence you success. Without the proper equipment, most of the fish might escape or might not even bite at all.

*

This is complete text that appears in flash animation on Argo Team web site. This is the story I told the owner of Blue Rabbit s.r.o. to persuade him to try data driven marketing campaigns in order to increase sales from their customers with discount cards.

*

Blog post from Mirko Marović in English.