January 23, 2015 at 10:49 am
Sioban Krzywicki (1/23/2015)
Sure! That's a way you can do it. Want to post some code to share with people? I find multiple ways of doing things is both nice to know and can help if one way runs into trouble.
For example, sometimes CTEs can run long on large datasets.
There is an example on this thread!
http://www.sqlservercentral.com/Forums/FindPost1502706.aspx
__________________________________________________________________________________________________________
How to Post to get the most: http://www.sqlservercentral.com/articles/Best+Practices/61537/
January 23, 2015 at 11:01 am
How do you delete a duplicate from a Heap?
If the table is a true Heap with no PK and the records in every column match you would have to delete all the rows and then insert the values back in only once.
January 23, 2015 at 11:40 am
PHYData DBA (1/23/2015)
How do you delete a duplicate from a Heap?If the table is a true Heap with no PK and the records in every column match you would have to delete all the rows and then insert the values back in only once.
This method works just fine on a heap. You may notice I don't bother declaring any index on the table in the example because this method doesn't rely on a key. You Partition By all the columns that might have duplicates. The example actually shows this very thing.
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
January 27, 2015 at 8:27 am
Nice clear example Sioban. Thanks. I frequently come across clients who have unintended duplicates in their tables. Data cleansing is a major part of the ETL process in some organisations.
Rgds, Dave.
--------------------------------------
You must come with me, or you'll be late.
Late for what?
Late as in the late Arthur Dent. -- Slartibartfast
--------------------------------------
January 28, 2015 at 3:44 am
My way of working:
[font="Courier New"]
BEGIN TRAN
SELECT DISTINCT *
INTO #tempTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable
SELECT *
FROM #tempTable
COMMIT TRAN
DROP TABLE #tempTable
[/font]
January 28, 2015 at 9:04 am
hans.reidt (1/28/2015)
My way of working:[font="Courier New"]
BEGIN TRAN
SELECT DISTINCT *
INTO #tempTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable
SELECT *
FROM #tempTable
COMMIT TRAN
DROP TABLE #tempTable
[/font]
That certainly is a way you can do it, though it might take longer or cause memory problems on a large table and it depends on how big your server is. And it gets a little trickier than that if, say you only want to check 4 of 20 columns for duplicates.
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
January 28, 2015 at 9:12 am
I do this on tables with > 1.000.000 records
Something to keep in mind of course, is the number / complexity of indexes.
A possibility is to disable all indexes, remove the duplicates and rebuild the indexes.
Of course, when you have duplicates in your data, the problem is usually the import.
That I try to solve with an LEFT OUTER JOIN on the unique columns.
This way I get idempotence, which is nice
January 28, 2015 at 9:27 am
hans.reidt (1/28/2015)
I do this on tables with > 1.000.000 recordsSomething to keep in mind of course, is the number / complexity of indexes.
A possibility is to disable all indexes, remove the duplicates and rebuild the indexes.
Of course, when you have duplicates in your data, the problem is usually the import.
That I try to solve with an LEFT OUTER JOIN on the unique columns.
This way I get idempotence, which is nice
Sure, I'm not saying it can't work. I'm saying that depending on your server's power & resources, that might not be a great option. Especially when you get to really big table, 10 million, 100 million or more. Or if you're doing this on a data import. And I don't think I'd want that code in any automated deduping procedures.
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
Viewing 8 posts - 46 through 52 (of 52 total)
You must be logged in to reply to this topic. Login to reply