How to select TOP N rows from a table for each group?

by S.Moiseenko (2009-05-23)

Such a question arises, for example, when it is needed to present on site 3 most recent reviews within each news section, or advertisement of 5 most popular goods within each category.

In order to solve this problem, one should distribute the whole set of strings into groups, make the sorting in accordance with the required criteria (date and number of sales) within each group and consecutively choose the number of strings beginning with the first row in each group.

Such problem might be solved in the procedural way using temporary tables and/or cursors. In this connection I would like to suggest two solutions of the problems of the type which are solved on the site, i.e. containing one SELECT query. The first solution is of the classical type, which can be efficiently applied to the majority of DBMS, the second solution requires the use of new constructions, which have been worked out recently in SQL:1999 standard and are not so widely maintained yet.

Let's consider the following problem:

From Product table, retrieve for each group characterized by the type of the product three models with the lowest numbers.

I.e. one should retrieve 3 computers, 3 laptops and 3 printers, the numbers of which are lower than of the rest models within each group. As the model number is unique in Product table, there arises no problem with duplicate models. Note that the problem of duplicates is not of principal value; however it requires certain specification of the wording.

"Classical" solution

This solution is based on the algorithm of numeration of rows that are returned by a query. I.e. we numerate the rows, and then choose those of them, which have the numbers less than the given value. In accordance with the aforementioned algorithm, the query, which launching numeration of the whole set of rows in the table, which are listed in accordance with ascending model numbers can be recorded in the following way:


SELECT Pr1.model, COUNT(*) num
FROM Product Pr1 JOIN Product Pr2
	ON Pr1.model >= Pr2.model
GROUP BY Pr1.model

But for solution of our problem one should mark not the whole set but each group separately. It can be easily achieved, if the condition of combining of two tables also implies the condition of coincidence of product types, as well as grouping by the type:


SELECT Pr1.model, Pr1.type, COUNT(*) num
FROM Product Pr1 JOIN Product Pr2
	ON Pr1.type = Pr2.type AND Pr1.model >= Pr2.model
GROUP BY Pr1.type, Pr1.model
HAVING COUNT (*) <= 3
ORDER BY type, model

The clause


HAVING COUNT (*) < = 3

in accordance with the wording of the problem, says that the number of strings within each group should not exceed 3. Actually we have already solved this problem. It is enough to add data on producer (maker), and it can be done in various ways. I. e., it is enough to join the aforementioned query and Product table in accordance with model number, or to use the correlate subquery in SELECT clause. For academic purposes I will quote both approaches.

1. Joining


SELECT maker, X.model, X.type
FROM product JOIN (
		SELECT Pr1.model, Pr1.type
		FROM Product Pr1 JOIN Product Pr2
			ON Pr1.type = Pr2.type AND Pr1.model >= Pr2.model
		GROUP BY Pr1.type, Pr1.model
		HAVING COUNT (*) <= 3
				) X on X.model = product.model
ORDER BY type,model

Above we have excluded extra num column, which is used for demonstrational purposes, as we do not need to retrieve row number.

2. Subquery in SELECT clause


SELECT (SELECT maker
		FROM Product
		WHERE Product.model = Pr1.model) maker,
	Pr1.model, Pr1.type
FROM product Pr1 JOIN product Pr2
	ON Pr1.type = Pr2.type AND Pr1.model >= Pr2.model
GROUP BY Pr1.type, Pr1.model
HAVING COUNT (*) <= 3
ORDER BY type,model

Use of subquery in SELECT clause is possible, if it gives only one value for each string of the external query. This term is observed, as we choose the producer of the model, which is transferred from the external query and is unique (the primary key to Product table).

Solution based on ranking functions

Ranking functions - ROW_NUMBER, RANK, DENSE_RANK, and NTILE - have been present in SQL Server, beginning with 2005 version. Their appearance in SQL language was caused by the necessity to make simply ordered calculations. Namely, our exercise belongs to this class of problems. Now we have the chance to appreciate this kind of innovation. :-)

For solution of our problem RANK function is used. This function enables us to subdivide all the rows returned by the query into groups and to calculate the rank position in accordance with succession of each row in the group by way of the sorting given. As we are going to sort in accordance with unique model number, the position will actually coincide with row number in the group. Thus, this is the solution:


SELECT maker, model, type FROM
(
SELECT maker, model, type, RANK() OVER(PARTITION BY type ORDER BY model) num
FROM Product
) X
WHERE num <= 3

Namely, all the operations are executed within subquery. Exterior query is used only in order to limit the selection to three models within each group. In other words, we leave only those rows, the position number of which is not more than 3.

It is quite an economical kind of solution, isn't it? However let's consider the construction more closely


RANK() OVER(PARTITION BY type ORDER BY model)

The PARTITION BY type clause determines the groups, i.e. the rows of one and the same type will be found in the same group (the same value in type column).

The ORDER BY model clause sets the sorting of the rows within the group (model number being increasing).

Finally, RANK() determines the rank for each row in the group on the basis of the determined sorting, i.e. the first row in the group is marked with rank 1, the subsequent one, if it has a different model number, is marked with rank 2 and so on. As I have already said, as the number of the model is unique, each row in the group will have a different rank. Otherwise, the rows with identical number of the model would have the same rank.

Detailed description of ranking functions exceeds the scope of analysis of this article, but it is quite possible that I will write something on this matter for SQL Textbook.

Dzone.com

Previous | Index | Next

Home

SELECT exercises (rating stages)

DML exercises

Developers