October 7, 2013 at 9:18 pm
Comments posted to this topic are about the item Removing Duplicates
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
October 8, 2013 at 1:38 am
Since I have worked with Oracle I have removed duplicates using the rowid column. SQL Server has something similar but its an undocumented feature (might be changed without notice).
delete
from table t1
where exists
(select *
from table t2
where t1.column1 = t2.column1
and t1.column2 = t2.column2
...
and t1.rowid > t2. rowid) -- in Oracle
and t1.%%physloc%% > %%physloc%%) -- in SQL Server
This self join can be useful, you dont need to compare all rows for example and it removed more than one row in case of tripple duplicates.
October 8, 2013 at 1:56 am
peterswe (10/8/2013)
Since I have worked with Oracle I have removed duplicates using the rowid column. SQL Server has something similar but its an undocumented feature (might be changed without notice).delete
from table t1
where exists
(select *
from table t2
where t1.column1 = t2.column1
and t1.column2 = t2.column2
...
and t1.rowid > t2. rowid) -- in Oracle
and t1.%%physloc%% > %%physloc%%) -- in SQL Server
This self join can be useful, you dont need to compare all rows for example and it removed more than one row in case of tripple duplicates.
http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/
Not so sure how happy I would be using this - but then as Paul says all the cool stuff is undocumented.
Great question, I really like how you have included the ability to retain the latest version.. I usually use a more simple way to remove duplicate rows..
Using the same table...
Declare @FName varchar(30), -- values to look for
@LName varchar (30),-- values to look for
@cnt int -- count
Declare getallrecords cursor local static For
Select count (1), Fname, Lname
from DuplicateRow (nolock)
group by FName, LName having count(1)>1
Open getallrecords
Fetch next from getallrecords into @cnt,@FName,@LName
--Cursor to check with all other records
While @@fetch_status=0
Begin
Set @cnt= @cnt-1
Set rowcount @cnt
-- Deleting the duplicate records. Observe that all fields are mentioned at the where condition
Delete from DuplicateRow where Fname=@FName and LName=@LName
Set rowcount 0
Fetch next from getallrecords into @cnt,@FName,@LName
End
Close getallrecords
Deallocate getallrecords
Obviously depending on how many rows you need to compare to ensure your only removing duplicates would depend on what you declare and select.
Hope this helps...
Ford Fairlane
Rock and Roll Detective
October 8, 2013 at 3:59 am
Thanks for a useful article, clearly explained. I shall have occasion to use your method when I import data from Excel, which despite precautions has duplicates.
October 8, 2013 at 4:55 am
I believe the following statement in the article is misleading:
You don’t need to do the SELECT first, but it is nice to see what you’ll be deleting.
It is required that the select section be in the delete statement because the windowing functions cannot be part of the where clause. The following DOES NOT work:
SELECT
ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R
, FName
, LName
, JobTitle
, Age
FROM DuplicateRow
WHERE ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) > 1
Gives the following error:
Windowed functions can only appear in the SELECT or ORDER BY clauses.
October 8, 2013 at 4:58 am
Thanks Stefan. Great article.
October 8, 2013 at 7:19 am
tom.w.brannon (10/8/2013)
I believe the following statement in the article is misleading:You don’t need to do the SELECT first, but it is nice to see what you’ll be deleting.
It is required that the select section be in the delete statement because the windowing functions cannot be part of the where clause. The following DOES NOT work:
SELECT
ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R
, FName
, LName
, JobTitle
, Age
FROM DuplicateRow
WHERE ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) > 1
Gives the following error:
Windowed functions can only appear in the SELECT or ORDER BY clauses.
What I meant by "You don't need to do a SELECT first" is that you can just use the DELETE query. The SELECT subquery within the DELETE query is still needed. By "first" I meant running the SELECT to see the results before running the DELETE. Both have another SELECT subquery as part of the operation.
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
October 8, 2013 at 7:56 am
-- I have added DateInserted and Address columns to show how to remove all old records for the same person
--drop table #DuplicateRow
CREATE TABLE #DuplicateRow(
ID INT identity(1,1),
DateInserted Datetime,
FName varchar(30),
LName varchar(30),
JobTitle varchar(30),
Age tinyint,
[Address] Varchar(255)
)
GO
INSERT INTO #DuplicateRow (DateInserted,FName,LName,JobTitle,Age,[Address])
SELECT DateInserted,FName,LName,JobTitle,Age,[Address] FROM (
SELECT '1/1/12' [DateInserted], 'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '1 broad street,ny ' [Address]UNION ALL
SELECT '2/1/12' [DateInserted],'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '11 broad street,ny ' [Address]UNION ALL
SELECT '6/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age] , '1 main street,ny' [Address]UNION ALL
SELECT '3/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age] , '11 main street,ny' [Address]UNION ALL
SELECT '7/1/12' [DateInserted],'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '11/a broad street,ny' [Address]UNION ALL
SELECT '8/1/12' [DateInserted],'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '11/b broad street,ny' [Address]UNION ALL
SELECT '9/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age] , '115 main street,ny' [Address]UNION ALL
SELECT '10/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age], '118 main street,ny' [Address]
) A
SELECT * FROM #DuplicateRow order by FName,LName,JobTitle,Age,DateInserted desc
-- we need to remove all rows except id= 6 and 8
SELECT
ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY DateInserted desc ) R
, FName
, LName
, JobTitle
, Age
FROM #DuplicateRow
SELECT *
FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY DateInserted desc) R
, FName
, LName
, JobTitle
, Age
FROM #DuplicateRow
) B
WHERE R > 1
DELETE B
FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY DateInserted desc ) R
FROM #DuplicateRow
) B
WHERE R > 1
SELECT * FROM #DuplicateRow
October 8, 2013 at 7:57 am
Hello There,
You may check this article that provides detailed examples of how to remove duplicate records.
http://www.dfarber.com/computer-consulting-blog/2011/12/26/remove-duplicate-records-in-sql.aspx
It combines several values from several records into one good record.
Regards,
Doron
The Farber Consulting Group, Inc.
October 8, 2013 at 8:23 am
Short, sweet, to the point with great examples and good simple explanations. Very nicely done, Mr. Krzywicki.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 8, 2013 at 8:41 am
I can see the need to understand how to do this, although I tend to go back to the basics.
If you find yourself needing to do this, question if it be designed into the the table in the first place to prevent this?
After all, I think this leads into the initial issue - without deleting all, the engine needs a way to discern which one(s) to delete.
I realize sometimes you have no input / control into this, but a few words about this concept might be a worthwhile addition.
If you wanted to expand this, age (at least to me) should be calculated, not stored in most cases.
And the changing job title also drives me towards separating out to different tables and having effectivity dates.
But that is way beyond your intended scope.
Just trying to spark a thought or two, not to make a big deal about any of this.
October 8, 2013 at 8:48 am
Jeff Moden (10/8/2013)
Short, sweet, to the point with great examples and good simple explanations. Very nicely done, Mr. Krzywicki.
Thank you!
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
October 8, 2013 at 8:53 am
Hi Guys,
I see there is a wrong DB structure (table structure). Ussualy each master table should have at least a primary key. Using proper primary key you avoid the problem...
October 8, 2013 at 8:53 am
Greg Edwards-268690 (10/8/2013)
I can see the need to understand how to do this, although I tend to go back to the basics.If you find yourself needing to do this, question if it be designed into the the table in the first place to prevent this?
After all, I think this leads into the initial issue - without deleting all, the engine needs a way to discern which one(s) to delete.
I realize sometimes you have no input / control into this, but a few words about this concept might be a worthwhile addition.
If you wanted to expand this, age (at least to me) should be calculated, not stored in most cases.
And the changing job title also drives me towards separating out to different tables and having effectivity dates.
But that is way beyond your intended scope.
Just trying to spark a thought or two, not to make a big deal about any of this.
It is funny, I agonized a bit over the example table. As I was putting in the columns I kept thinking "Well, if this were a real database I'd put this in another table so it could have multiple values or historical data or maybe this should be calculated" then I reminded myself that was outside the scope of this article and I just needed something to use as an example. : -)
Same thing with table design, there are situations where you have no way to prevent this ahead of time, whether it is because the duplicates are in the data coming in or because the table is already in production and the powers that be won't approve a structural change. I wanted to keep the focus on this one task as dealing with every possibility would make the article far longer.
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
October 8, 2013 at 8:56 am
Jozef Moravcik (10/8/2013)
Hi Guys,I see there is a wrong DB structure (table structure). Ussualy each master table should have at least a primary key. Using proper primary key you avoid the problem...
Sure, and if duplicate data gets into your table, you can use the primary key to remove duplicates, but this gives you a way to find what those duplicates are.
But what do you do if you don't have the ability to add the primary key? Or if you need to remove duplicates on large batches of data that are coming into your system from outside sources? This should help with any of those situations.
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
Viewing 15 posts - 1 through 15 (of 52 total)
You must be logged in to reply to this topic. Login to reply