Sunday 6 February 2011

Deleting Duplicate Rows

Myself and two of my colleges were lucky enough to be invited to meet Denise Drapper (Head of Data Integration at Microsoft) last week, and amongst other things, she gave us a tour of the upcoming DQS (Data Quality Services) that is set to ship with Denali. Basically, it supports the kind of functionality that previously we would have to do manually, by using Fuzzy Lookup component, etc in SSIS, but this is using different algorithms and is all wrapped up into a nice easy to use GUI, that makes suggestions about possible data quality issues, with a similarity score, and then asks you to confirm or deny the duplicate. It also learns from the work you have done previously, which is pretty cool.

Unfortunately, it doesn't look like I'm going to get my hands on this tech, to play around with it until CTP3 in the Autumn, however, so for now, we are stuck with the tools we have. And that brings me nicely to the point of this post. Basically, if you scan the Net for ways of deleting rows which are duplicated, but have different keys, there are several solutions, most of them derived from the Microsoft solution, which is not pretty, and involves moving data into # tables, etc.

In SQL 05 and 08, however, providing that there is a way of uniquly identifying each row on a series of columns, other than the unique key (even if this is consists of every column in the table) there is a method that can delete duplicates in one statement, without any # tables. (Although it will still create structures behind the scenes in TempDB).

Lets assume that we have a table created with the following structure...

CREATE TABLE DeleteDuplicates
(
    PKCol INT IDENTITY PRIMARY KEY CLUSTERED,
    Col1 INT,
    Col2 VARCHAR(20),
    Col3 VARCHAR(20)
)


...Now lets populate it with a few rows...

INSERT INTO DeleteDuplicates(Col1, Col2, Col3)
VALUES (1, 'Peter', 'Carter-Greenan'),
       (2, 'Peter', 'Carter-Greenan'),
       (3, 'Charles', 'Feddersen')


...Ok, so if we assume that in this table, business rules allow us to identify duplicates using Col2 and Col3, we can see that we have one duplicate. So now, we can run the following query to remove the duplicate row...

DELETE
FROM DeleteDuplicates
WHERE PKCol IN
(

    SELECT PKCol FROM
    (
    SELECT

        PKCol,
        ROW_NUMBER() OVER(PARTITION BY Col2, Col3 ORDER BY Col2, Col3) as RowNumber
        FROM DeleteDuplicates
    ) Dupes
    WHERE Dupes.RowNumber > 1
)


...So there you have it. Dupes removed, no temp tables, one statement. What more is there to say?


Find my book, Pro SQL Server Administration on Amazon -

America

United Kingdom

No comments:

Post a Comment