how to find duplicates in a table as per particular column

Question

Post reply

how to find duplicates in a table as per particular column

Jackson89

Ten Centuries

Points: 1141
More actions
November 2, 2015 at 4:49 am

#312736

Hi All,
I am having a main table and temptable.
Every sunday, new data will be loaded from temptable to main table. I have to make sure that, duplicates does not get loaded from temptable to maintable.
For example, if last sunday a record gets loaded from temp to main. If this sunday also the same record is present then it means that is a duplicate.
The duplicate is decided on below scenario
select 'CodeChanges: ', count(*) from CodeChanges a, CodeChanges_Temp b
where a.AccountNumber = b.AccountNumber and
a.HexaNumber = b.HexaNumber and
a.HexaEffDate = b.HexaEffDate and
a.HexaId = b.HexaId and
a.TransactionEffDate = b.TransactionEffDate and
isnull(a.TransactionExpDate,getdate()) = isnull(b.TransactionExpDate,getdate()) and
a.HexaCode = b.HexaCode and
and a.CompanyCode = b.CompanyCode
a.CategoryCode = b.CategoryCode and
a.AdjustmentUserId = b.AdjustmentUserId
Yesterday (Sunday) , data from temp got loaded onto maintable but with duplicates.
There is a log which just displays number of duplicates.
Yesterday the log displayed 8 duplicates found. I need to find out the 8 duplicates which got loaded yesterday and delete it off from main table.
There is a column in both tables which is 'creation date and time'. Every Sunday when the load happens this column will have that day's date .
Now i need to find out what are all the duplicates which got loaded on this sunday.
The total rows in temp table is : 363
No of duplicates present is : 8
I used below query to find out the duplicates but it is returning all the 363 rows from the maintable instead of the 8 duplicates.
Select 'CodeChanges: ', * from CodeChanges a where
exists
( Select 1 from CodeChanges_Temp b where
a.HexaNumber = b.HexaNumber and
a.HexaEffDate = b.HexaEffDate and
a.HexaId = b.HexaId and
a.TransactionEffDate = b.TransactionEffDate and
isnull(a.TransactionExpDate,getdate()) = isnull(b.TransactionExpDate,getdate()) and
a.HexaCode = b.HexaCode and
and a.CompanyCode = b.CompanyCode
a.CategoryCode = b.CategoryCode and
a.AdjustmentUserId = b.AdjustmentUserId
b.CreationDateTime = '2015-11-01 00:00:00.000'
)
and a.CreationDateTime = '2015-11-01 00:00:00.000'
Please help me in finding the duplicate records which has creation date time as '2015-11-01 00:00:00.000'
and all the above columns mentioned in the query matches.
Example
Few colums only metioned below
creationdateandtime HexaNumber HexaCode
1. 1987-10-01 00:00:00.000 5 3
2. 2015-11-01 00:00:00.000 5 3
So here the second record is duplicate. This is what I am trying to find.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

A query like this

SELECT

rn = ROW_NUMBER() OVER(PARTITION BY AccountNumber, HexaNumber, HexaEffDate, HexaId, TransactionEffDate, TransactionExpDate, HexaCode, CompanyCode, CategoryCode, AdjustmentUserId

ORDER BY (SELECT NULL)),

*

FROM CodeChanges

should identify your dupes. You might have to adjust the column list in the PARTITION BY clause. When you are sure that the column list is correct, "primary" values have rn = 1 and "duplicate" values have rn > 1.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Jackson89 Ten Centuries Points: 1141 More actions · Answer 2

Hi Chris,

Thanks very much for the advise !

I have one doubt, please clarify on the below quotes,

You might have to adjust the column list in the PARTITION BY clause

You mean if I have anymore columns to add in the condition, then I need to add, right ?

When you are sure that the column list is correct, "primary" values have rn = 1 and "duplicate" values have rn > 1.

you mean the output will show all the records which is unique as well as duplicate ?

I just need the duplicate records.

The count of CodeChanges table is 588067. I think its very high.

So just wanted to confirm before running the query to avoid any database issue.

Please advise .

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

muralikrishna2489 (11/2/2015)
Hi Chris,
Thanks very much for the advise !
I have one doubt, please clarify on the below quotes,
You might have to adjust the column list in the PARTITION BY clause
You mean if I have anymore columns to add in the condition, then I need to add, right ?
When you are sure that the column list is correct, "primary" values have rn = 1 and "duplicate" values have rn > 1.
you mean the output will show all the records which is unique as well as duplicate ?
I just need the duplicate records.
The count of CodeChanges table is 588067. I think its very high.
So just wanted to confirm before running the query to avoid any database issue.
Please advise .

The column list in the PARTITION BY clause should exactly match the column list which defines a duplicate in your table.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Ed Wagner SSC Guru Points: 287026 More actions · Answer 4

muralikrishna2489 (11/2/2015)
you mean the output will show all the records which is unique as well as duplicate ?
I just need the duplicate records.
The count of CodeChanges table is 588067. I think its very high.
So just wanted to confirm before running the query to avoid any database issue.
Please advise .

Take a look at Chris's note below his query. The primary rows will have rn = 1. The duplicates (as defined by the column list in the PARTITION BY) will have rn > 1. To select only your duplicates, simply add the WHERE clause to the outer query:

WHERE rn > 1

FYI: If you're going to delete the duplicate rows, you can delete from a CTE. Instead of SELECTing from the CTE, change it to a DELETE from the CTE. This gives you the opportunity to see what you're going to delete before you delete it. Example:

WITH cte AS (

SELECT rn = ROW_NUMBER() OVER(PARTITION BY AccountNumber, HexaNumber, HexaEffDate, HexaId, TransactionEffDate, TransactionExpDate, HexaCode, CompanyCode, CategoryCode, AdjustmentUserId ORDER BY (SELECT NULL)), *

FROM CodeChanges

)

SELECT *

FROM cte

WHERE rn > 1;

In this code, if you change the "SELECT * FROM" to a "DELETE FROM" you will delete the duplicates.

Please make sure you understand this before running it in production.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions