How to delete duplicates in the presence of a primary key?

by S.Moiseenko (2009-07-25)

In previous paper we were considering the duplicate rows problem resolving, caused by lack of a primary key. Now let consider more difficult case, when key is present but it is synthetic. When design is improper it can lead to duplicate rows appearing, from a subject area view point.

It is strange, but even if I am tell to my students about synthetic primary key disadvantages, they are still use ones in their first data base projects. Probably all people need to enumerate everything :)

I don't want to discuss here trite synthetic key problem. Just would like to tell if you decided to use synthetic key as a primary, you should create natural unique key to avoid situation described below.

So, let we have a table with primary key �id� and column �name�. In accordance with subject area restrictions the column �name� must contain unique value. However, if the table structure is determined as


CREATE TABLE T_pk (id INT IDENTITY PRIMARY KEY,
			  name VARCHAR(50));

a duplicate rows can appear. It will be better to use table design as


CREATE TABLE T_pk (id INT IDENTITY PRIMARY KEY,
			  name VARCHAR(50) UNIQUE);

Everybody know which way is better but sometimes we need to deal with inherited structure and data, which is violate subject area restrictions. For example:

id   	name  
1	John
2	Smith
3	John
4	Smith
5	Smith
6	Tom

You can ask: �What is the difference between this problem and previous one? Probably here we have more easy solution. We just need to delete all rows from each groups with same value of a name except rows with minimum/maximum value of an id. For example so:�


DELETE
FROM T_pk
WHERE id > (SELECT MIN(id) FROM T_pk X WHERE X.name = T_pk.name);

It is right, but I still have not told you everything. Let we have the derived table T_details associated with the table �T_pk� on the foreign key:


CREATE TABLE T_details (id_pk INT FOREIGN KEY REFERENCES
	T_pk ON DELETE CASCADE,
	color VARCHAR(10),
	PRIMARY KEY (id_pk, color);

This table can contain data like this:

id_pk    color  
1	blue
1	red
2	green
2	red
3	red
4	blue
6	red

For better visibility let use query

SELECT id, name, color FROM T_pk JOIN T_details ON id= id_pk;

to see the names

id   	name    color   
1	John	blue
1	John	red
2	Smith	green
2	Smith	red
3	John	red
4	Smith	blue
6	Tom	red

It is shown that one person data is erroneously belong to different parent entries. Furthermore, duplicate rows are in this table as well:

1	John	red
3	John	red

Similar data will lead to erroneously data analysis. Furthermore, a cascade deleting will lead to data loosing. For example, if we will left only minimum identifier rows in each groups of the T_pk table, we will lost the row

4	Smith	blue

in the table T_details. Consequently, during duplicate entries deleting we should take into account both tables - T_pk and T_details.

The data cleaning can be done in two stages:

Update table T_details to join a data, which is related to one name, to a row with minimum id.
Delete duplicate entries from table T_pk, except rows with minimal id, in each group characterized by the same value in the name column.

Updating T_details table

Query


SELECT id_pk, name, color
	   , RANK() OVER(PARTITION BY name, color ORDER BY name, color, id_pk) dup
	   ,(SELECT MIN(id)  FROM T_pk WHERE T_pk.name = X.name) min_id
FROM T_pk X JOIN T_details ON id=id_pk;

determines the number of a duplicate rows (value of dup value is greater than 1) and the minimum value of an id in a equal-name groups (min_id). Here is the result of that query:

id_pk   name    color   dup  min_id   
1	John	blue	1	1
1	John	red	1	1
3	John	red	2	1
4	Smith	blue	1	2
2	Smith	green	1	2
2	Smith	red	1	2
6	Tom	red	1	6

Now we need to replace the value of a id_pk to min_pk for each row, except the third one, because it is a duplicate copy of the second row. The value of a dup=2 is indicate on that. The query for updating can be as:


UPDATE T_details
SET id_pk=min_id
FROM T_details T_d JOIN (
		SELECT id_pk, name, color
			   , RANK() OVER(PARTITION BY name, color ORDER BY name, color, id_pk) dup
			   ,(SELECT MIN(id)  FROM T_pk WHERE T_pk.name = X.name) min_id
		FROM T_pk X JOIN T_details ON id=id_pk
						) Y ON Y.id_pk=T_d.id_pk
WHERE dup =1;

Updated table T_details will be like this:

id_pk   color
1	blue
1	red
2	blue
2	green
2	red
3	red
6	red

It is shown that only one duplicate row is left:

3	red

But it is no need to worry about that row because it will be deleted after duplicate rows cascade deleting from table T_pk:


DELETE
FROM T_pk
WHERE id > (SELECT MIN(id) FROM T_pk X WHERE X.name = T_pk.name);

The last query is a second stage of a deleting procedure. The result of that query looks like this:

Table T_pk
id   	name  
1	John
2	Smith
6	Tom

Table T_details
id_pk   color 
1	blue
1	red
2	blue
2	green
2	red
6	red

Only need to apply restriction to avoid duplicates in future:


ALTER TABLE T_pk
ADD CONSTRAINT unique_name UNIQUE(name);

Dzone.com

Previous | Index | Next

Home

SELECT exercises (rating stages)

DML exercises

Developers