We always in inevitability end up with duplicates data where we don’t need it. This is a simple way to find and remove those duplicates. I have documented this procedure for MSSQL 2008 up. Email me if you want the same type of procedure for Databases where you cannot run a CTE (common Table expression).
Finding the duplicates
The first hurdle to success is to find the columns that we can use to determine whether the duplicate exists, this is don’t by writing a query to verify the duplicate Data exists, without that then this is almost a pointless task. In my case It was with a customer Table (Table and Column obfuscated names obviously). Numerous customers had value had duplicate ProfileKey’s, whilst this is only an idea it was critical to avoid duplicates in many systems such as billing and customer management systems.
WITH CTE_DUPLICATE_Profile AS ( SELECT Id, ROW_NUMBER() OVER (PARTITION BY ProfileKey, CustomerName, ORDER BY CustomerId) AS row_num, CustomerId, CustomerName, BusinessSystem, UpdateDate FROM CustomerProfile ) SELECT * FROM CTE_DUPLICATE_Profile WHERE row_num > 1;
Looking at the code above, we essentially are numbering the rows per occurrence. Obviously, any row number greater then two means that duplicates exist, this is performed by the filter “WHERE row_num > 1;”.
Removing the Duplicates
Now to remove the duplicate we need to alter our script slightly, so that we can delete these duplicate rows. Depending on your requirements you will need to filter you records to determine which is the most current or Valid row you want to keep.
WITH CTE_DUPLICATE_Profile AS ( SELECT Id, ROW_NUMBER() OVER (PARTITION BY ProfileKey, CustomerName, ORDER BY CustomerId) AS row_num, CustomerId, CustomerName, BusinessSystem, UpdateDate FROM CustomerProfile ) DELETE FROM CTE_DUPLICATE_Profile WHERE row_num > 1;
In this simple real world Scenario we have deleted the duplicates, we can employ the same method to find, Merge and Delete the duplicates.