Sunday, April 10, 2011

Rows matching all the values from a list

Result that matches all the values from a list is title of recent discussion on SQL Team.

Someone asked for query that returns Ids of products having both properties with id 20 and 23. The sample of data is in the following table:


IdProduct
IdProperty
19
19
19
23
20
20
20
23

Three solutions were offered.

The first solution uses only group by and having
SELECT IdProduct
FROM tbPropertyProduct
GROUP BY IdProduct
HAVING COUNT(DISTINCT CASE WHEN IdProperty IN (20, 23) THEN IdProperty END) = 2;

The second solution, filters selects rows having either property 20 or 23, then using group by finds products having both properties.
SELECT IdProduct
  FROM Table
 WHERE IdProperty IN (20, 23)
 GROUP BY IdProduct
HAVING COUNT(DISTINCT IdProperty) = 2;
 The third solution finds rows with one property and among them selects only products having another property too.
select p1.IdProduct
  from tbPropertyProduct p1
  join tbPropertyProduct p2 on p1.idProduct = p2.idProduct 
  where p1.idProperty = 20 and p2.idProperty = 23
Calculation of number of rows each solution will access follows.

N is number of rows in the table. p20 is number of products with property 20, p23 is number of products with property 23. Let me assume that p20 <= p23 <= N. Under normal circumstances, it is reasonable to expect that p20 < p23 << N.

The first index configuration is a composite index on IdProperty and IdProduct in that order.

The first solution would access all rows in the index, group them, count distinct properties and ideally eliminate products as soon as if finds the third property for given product id.
It means it would access all N index entries.

The second solution would access only entries with properties 20 and 23, and count number of distinct properties for each product. The number of index items accessed is p20 + p23 in that case.

The third query would ideally access index items with IdProperty = 20 and for each of them try to find the entry with the same product id and IdProperty = 23. The number of rows would be between p20 and p20 + p23, on average p20 + p20 * (p23/N) which is equal to p20 + p23*(p20/N).

If there is only composite index on IdProduct and IdProperty in that order, all 3 solutions would access  all index entries, however amount of processing after entries are accessed would be different. It is clear that the first query would need to do more then second one, because the second one would eliminate all entries with inappropriate property without counting. However, it is not clear which query, 2nd or 3rd would be more efficient. When deciding which one to use, you need to test it against your data and see which one works better.