Search Wiki:

Find and/or Delete Duplicate Rows


Let us see how to find and delete duplicate rows from a table which does not have a Primary Key.

Sample Table:

ID CustName Pincode
1 Jack 45454
2 Jill 43453
3 Tom 43453
4 Kathy 22343
5 David 65443
6 Kathy 22343
7 Kim 65443
8 Hoggart 33443
9 Kate 61143
10 Kim 65443


Desired Output: after removing duplicates

ID CustName Pincode
1 Jack 45454
2 Jill 43453
3 Tom 43453
4 Kathy 22343
5 David 65443
7 Kim 65443
8 Hoggart 33443
9 Kate 61143

Create Sample Data

-- Suppress data loading messages
SET NOCOUNT ON
 
-- Create Table
CREATE TABLE #Customers (ID integer, CustName varchar(20), Pincode int)
 
-- Load Sample Data in Table
INSERT INTO  #Customers VALUES (1, 'Jack',45454 )
INSERT INTO  #Customers VALUES (2, 'Jill', 43453)
INSERT INTO  #Customers VALUES (3, 'Tom', 43453)
INSERT INTO  #Customers VALUES (4, 'Kathy', 22343)
INSERT INTO  #Customers VALUES (5, 'David', 65443)
INSERT INTO  #Customers VALUES (6, 'Kathy', 22343)
INSERT INTO  #Customers VALUES (7, 'Kim', 65443)
INSERT INTO  #Customers VALUES (8, 'Hoggart', 33443)
INSERT INTO  #Customers VALUES (9, 'Kate', 61143)
INSERT INTO  #Customers VALUES (10, 'Kim', 65443)

Approach: Some questions to ask yourself before going ahead

What are the rules which qualify a row as a duplicate row in the Customers table?
Rows are qualified as duplicate rows in the Customers table if they have similar CustName and Pincode.

How do I identify duplicate rows?
Group the rows by CustName, Pincode. Rows having similar CustName and Pincode will have more than one rows in the grouping. So locate them using HAVING COUNT(*) > 1. If duplicate values are encountered, return the maximum ID for each duplicate row. Using the outer query, delete any ID returned by subquery.

SQL Server 2005 Solution
-- Find Duplicate Rows
SELECT	MAX(ID) as ID, CustName, Pincode FROM #Customers
GROUP BY CustName, Pincode
HAVING COUNT(*) > 1
 
-- Delete Duplicate Rows
DELETE FROM #Customers
WHERE ID IN 
( SELECT	MAX(ID) FROM #Customers
GROUP BY CustName, Pincode
HAVING COUNT(*) > 1)

SQL Server 2000 Solution
-- Find Duplicate Rows
SELECT	MAX(ID) as ID, CustName, Pincode FROM #Customers
GROUP BY CustName, Pincode
HAVING COUNT(*) > 1
 
-- Delete Duplicate Rows
DELETE FROM #Customers
WHERE ID IN 
( SELECT	MAX(ID) FROM #Customers
GROUP BY CustName, Pincode
HAVING COUNT(*) > 1)
___________________________________________________________________________________________________________________
Page Created By: Suprotim Agarwal, March 20, 2008
Last edited Apr 25 2008 at 10:54 AM  by SuprotimAgarwal, version 7
Comments
ArnieRowland wrote  Apr 12 2008 at 5:42 PM  
(This is a test comment to see if the author will be notified.)
Suprotim, will you be adding a SQL 2000 method to the Duplicate Rows page?

zuomin wrote  Apr 17 2008 at 12:09 PM  
Optimize the find duplicate rows query a little bit.

SELECT MAX(ID), CustName, Pincode FROM #Customers GROUP BY CustName, Pincode
HAVING COUNT(*) > 1;

Vansha wrote  Apr 23 2008 at 9:37 AM  
Another solution in a case when more than two duplicate rows occurs.

-- Append third duplicate (to rows with Id=7 and Id=10) row to test data
INSERT INTO #Customers VALUES (11, 'Kim', 65443)

in this case row (7, 'Kim', 65443) and row (10, 'Kim', 65443) still present in the table after script runs.

-- Delete All Duplicate Rows
DELETE FROM #Customers
WHERE EXISTS (
SELECT NULL
FROM #Customers c2
WHERE #Customers.CustName = c2.CustName
AND #Customers.Pincode = c2.Pincode
AND #Customers.ID > c2.Id
)

SuprotimAgarwal wrote  Apr 25 2008 at 10:52 AM  
Thanks Vansha. If there are more than one duplicate rows, your suggested query could also be written like this:

SELECT * FROM #Customers cust WHERE EXISTS
(SELECT * FROM #Customers where CustName = cust.CustName and Pincode = cust.Pincode and ID < cust.ID)

DELETE cust FROM #Customers cust WHERE EXISTS
(SELECT * FROM #Customers where CustName = cust.CustName and Pincode = cust.Pincode and ID < cust.ID)

RyanRandall wrote  May 19 2008 at 2:17 AM  
Another option for SQL 2005. This technique works for multiple duplicates AND when there is no way of uniquely identifying the row.

-- Delete duplicate rows
; WITH a AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY CustName, Pincode
ORDER BY ID) AS RowId FROM #Customers)
DELETE FROM a WHERE RowId > 1

epikarma wrote  May 24 2008 at 2:16 AM  
Maybe should be better using self join to identify and delete duplicates.
See: http://www.ugmfree.it/TipsTsql.aspx?tip=TipTsqlDeleteDuplicates

gr8tushar wrote  Nov 12 2008 at 1:48 PM  
Ryan solutions looks good. It would work even if we have multiple duplicate contacts.

andrej351 wrote  May 6 2009 at 12:34 AM  
good solution, i used this method to identify duplicate rows:

SELECT Alias1.*
FROM TableWithDuplicates AS Alias1 INNER JOIN
TableWithDuplicates AS Alias2 ON Alias1.PrimaryKeyColumn <> Alias2.PrimaryKeyColumn AND Alias1.DuplicateColumn = Alias2.DuplicateColumn
ORDER BY Alias1.DuplicateColumn

Updating...
Page view tracker