Removing Duplicates

Question

Removing Duplicates

Viewing 8 posts - 46 through 52 (of 52 total)

You must be logged in to reply to this topic. Login to reply

LinksUp SSCertifiable Points: 6526 More actions · Answer 1

Sioban Krzywicki (1/23/2015)
Sure! That's a way you can do it. Want to post some code to share with people? I find multiple ways of doing things is both nice to know and can help if one way runs into trouble.
For example, sometimes CTEs can run long on large datasets.

There is an example on this thread!

http://www.sqlservercentral.com/Forums/FindPost1502706.aspx

__________________________________________________________________________________________________________
How to Post to get the most: http://www.sqlservercentral.com/articles/Best+Practices/61537/

PHYData DBA SSCertifiable Points: 7541 More actions · Answer 2

How do you delete a duplicate from a Heap?

If the table is a true Heap with no PK and the records in every column match you would have to delete all the rows and then insert the values back in only once.

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 3

PHYData DBA (1/23/2015)
How do you delete a duplicate from a Heap?
If the table is a true Heap with no PK and the records in every column match you would have to delete all the rows and then insert the values back in only once.

This method works just fine on a heap. You may notice I don't bother declaring any index on the table in the example because this method doesn't rely on a key. You Partition By all the columns that might have duplicates. The example actually shows this very thing.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

dave hants Default port Points: 1485 More actions · Answer 4

Nice clear example Sioban. Thanks. I frequently come across clients who have unintended duplicates in their tables. Data cleansing is a major part of the ETL process in some organisations.

Rgds, Dave.

--------------------------------------

You must come with me, or you'll be late.

Late for what?

Late as in the late Arthur Dent. -- Slartibartfast

--------------------------------------

Hansiehans SSC-Addicted Points: 454 More actions · Answer 5

My way of working:

[font="Courier New"]

BEGIN TRAN

SELECT DISTINCT *

INTO #tempTable

FROM myTable

TRUNCATE TABLE myTable

INSERT INTO myTable

SELECT *

FROM #tempTable

COMMIT TRAN

DROP TABLE #tempTable

[/font]

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 6

hans.reidt (1/28/2015)
My way of working:
[font="Courier New"]
BEGIN TRAN
SELECT DISTINCT *
INTO #tempTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable
SELECT *
FROM #tempTable
COMMIT TRAN
DROP TABLE #tempTable
[/font]

That certainly is a way you can do it, though it might take longer or cause memory problems on a large table and it depends on how big your server is. And it gets a little trickier than that if, say you only want to check 4 of 20 columns for duplicates.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

Hansiehans SSC-Addicted Points: 454 More actions · Answer 7

I do this on tables with > 1.000.000 records

Something to keep in mind of course, is the number / complexity of indexes.

A possibility is to disable all indexes, remove the duplicates and rebuild the indexes.

Of course, when you have duplicates in your data, the problem is usually the import.

That I try to solve with an LEFT OUTER JOIN on the unique columns.

This way I get idempotence, which is nice

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 8

hans.reidt (1/28/2015)
I do this on tables with > 1.000.000 records
Something to keep in mind of course, is the number / complexity of indexes.
A possibility is to disable all indexes, remove the duplicates and rebuild the indexes.
Of course, when you have duplicates in your data, the problem is usually the import.
That I try to solve with an LEFT OUTER JOIN on the unique columns.
This way I get idempotence, which is nice

Sure, I'm not saying it can't work. I'm saying that depending on your server's power & resources, that might not be a great option. Especially when you get to really big table, 10 million, 100 million or more. Or if you're doing this on a data import. And I don't think I'd want that code in any automated deduping procedures.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

Removing Duplicates

Cookies on SQLServerCentral