June 28, 2017 at 8:15 am
I have a very wide table about 46 columns, there are approx 328 million rows in the table. We are trying to dedup the table but run in to out of memory errors.
I have tried a couple of techniques and really want to avoid using a cursor. What would the best way to dedup these tables. (this one is the largest in size and presenting the largest problem.
I would like to do this a little dynamic. We are wanting to use some parameters to loop through the table and dedup that way. There are approx 928 unique DataSet (unique identifiers for batching) that are in the middle as a FK) We can use these to loop through the datasets.
Anyone know of any examples I could look at or offer any suggestions.
CTE?, SSIS? Thoughts?
June 28, 2017 at 8:34 am
SELECT DISTINCT *
INTO #TempTable
FROM OriginalTable
TRUNCATE OriginalTable
INSERT INTO OriginalTable
SELECT * FROM #TempTable
Not knowing the structure of the table, but you may need to break this into pieces, or use a third staging table.
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
June 28, 2017 at 8:38 am
Michael L John - Wednesday, June 28, 2017 8:34 AMSELECT DISTINCT *
INTO #TempTable
FROM OriginalTableTRUNCATE OriginalTable
INSERT INTO OriginalTable
SELECT * FROM #TempTableNot knowing the structure of the table, but you may need to break this into pieces, or use a third staging table.
Thanks, I'll give that a try.
June 28, 2017 at 12:10 pm
Stephen crocker - Wednesday, June 28, 2017 8:38 AMMichael L John - Wednesday, June 28, 2017 8:34 AMSELECT DISTINCT *
INTO #TempTable
FROM OriginalTableTRUNCATE OriginalTable
INSERT INTO OriginalTable
SELECT * FROM #TempTableNot knowing the structure of the table, but you may need to break this into pieces, or use a third staging table.
Thanks, I'll give that a try.
Just be sure you have enough disk space for the tempdb system database, as this method will use as much disk space in tempdb as the existing table occupies, and at 328 million rows, that could be an awful lot. A row size in the original table of just 1,000 bytes would require 328 GIGABYTES !!!
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
June 28, 2017 at 1:35 pm
wouldn't it be better to calculate a checksum or binary checksum for the list of columns to test instead?
the space used would be substantially smaller, right?, then you could use a temp table with those values, and a CTE with a rownumber to delete dupes?
SELECT CHECKSUM(col1,col2,col46)
FROM MyTable
GROUP BY CHECKSUM(col1,col2,col46)
HAVING COUNT(*) > 1
Lowell
June 28, 2017 at 7:45 pm
Lowell - Wednesday, June 28, 2017 1:35 PMwouldn't it be better to calculate a checksum or binary checksum for the list of columns to test instead?
the space used would be substantially smaller, right?, then you could use a temp table with those values, and a CTE with a rownumber to delete dupes?
SELECT CHECKSUM(col1,col2,col46)
FROM MyTable
GROUP BY CHECKSUM(col1,col2,col46)
HAVING COUNT(*) > 1
Lowell makes a good point. I like the ROW_NUMBER approach myself.
Another thing to keep in mind is the size of the log file. The INSERTs in Mike's approach and the DELETEs in Lowell's approach are all fully logged operations. If it were me, I'd figure out how many rows you're going to keep and delete and let that influence my decision.
June 30, 2017 at 6:48 pm
Just remember that checksum is NOT reliable when it comes to the determination of what is a duplicate. It should only be used for determining what COULD be duplicates and what are definitely not. For anything that checksum returns as a duplicate, it's only a possibility of being a duplicate and further testing is required to determine if the items actually ARE duplicates.
A short but pointed example follows... are they duplicates or not? SELECT CS1 = CHECKSUM('A352KD')
,CS2 = CHECKSUM('A352NT')
;
Results... CHECKSUM thinks they're duplicates...CS1 CS2
----------- -----------
141500177 141500177
--Jeff Moden
Change is inevitable... Change for the better is not.
July 1, 2017 at 1:51 am
Jeff Moden - Friday, June 30, 2017 6:48 PMJust remember that checksum is NOT reliable when it comes to the determination of what is a duplicate. It should only be used for determining what COULD be duplicates and what are definitely not. For anything that checksum returns as a duplicate, it's only a possibility of being a duplicate and further testing is required to determine if the items actually ARE duplicates.A short but pointed example follows... are they duplicates or not?
SELECT CS1 = CHECKSUM('A352KD')
,CS2 = CHECKSUM('A352NT')
;Results... CHECKSUM thinks they're duplicates...
CS1 CS2
----------- -----------
141500177 141500177
hmmm...what version of SQL you running Jeff?
here's my results
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
July 1, 2017 at 3:46 am
WITH Rows AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY BINARY_CHECKSUM(*) ORDER BY BINARY_CHECKSUM(*)) AS [Count]
FROM Table
)
DELETE FROM Rows WHERE [Count] > 1 ;
As it was already mentioned, CHECKSUM and BINARY_CHECKSUM can have collision and thus, not have 100% uniqueness. You can alternatively use HASHBYTES(), which will be more accurate and has been advertised as collision free, but you still risk the issue of trust and there is a 8,000 byte limitation. I've used both in the past and it's worked fine on both smaller and bigger. However, the trust issue is the big one that turns away most DBA's and I don't think it's still 100% unique.
Alternatively, you can use similar but partition by your actual fields because it's essentially grouping it by every field and counting it. then you just DELETE in the same fashion with the CTE. This will be more accurate than hashing.
July 1, 2017 at 5:55 am
J Livingston SQL - Saturday, July 1, 2017 1:51 AMJeff Moden - Friday, June 30, 2017 6:48 PMJust remember that checksum is NOT reliable when it comes to the determination of what is a duplicate. It should only be used for determining what COULD be duplicates and what are definitely not. For anything that checksum returns as a duplicate, it's only a possibility of being a duplicate and further testing is required to determine if the items actually ARE duplicates.A short but pointed example follows... are they duplicates or not?
SELECT CS1 = CHECKSUM('A352KD')
,CS2 = CHECKSUM('A352NT')
;Results... CHECKSUM thinks they're duplicates...
CS1 CS2
----------- -----------
141500177 141500177hmmm...what version of SQL you running Jeff?
here's my results
2008 Dev and 2012 Enterprise. Glad to see they finally did something to move away from a simple Byte Level XOR in 2016 but I hope they didn't make it perform worse. Checksum has become a staple for me in improving the performance of constrained random data generation.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 1, 2017 at 6:02 am
Jeff Moden - Saturday, July 1, 2017 5:55 AMJ Livingston SQL - Saturday, July 1, 2017 1:51 AMJeff Moden - Friday, June 30, 2017 6:48 PMJust remember that checksum is NOT reliable when it comes to the determination of what is a duplicate. It should only be used for determining what COULD be duplicates and what are definitely not. For anything that checksum returns as a duplicate, it's only a possibility of being a duplicate and further testing is required to determine if the items actually ARE duplicates.A short but pointed example follows... are they duplicates or not?
SELECT CS1 = CHECKSUM('A352KD')
,CS2 = CHECKSUM('A352NT')
;Results... CHECKSUM thinks they're duplicates...
CS1 CS2
----------- -----------
141500177 141500177hmmm...what version of SQL you running Jeff?
here's my results2008 Dev and 2012 Enterprise. Glad to see they finally did something to move away from a simple Byte Level XOR in 2016 but I hope they didn't make it perform worse. Checksum has become a staple for me in improving the performance of constrained random data generation.
would be happy to compare performance of random data generation on 2016 if you care to provide a script.
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
July 1, 2017 at 7:09 am
xsevensinzx - Saturday, July 1, 2017 3:46 AMHowever, the trust issue is the big one that turns away most DBA's and I don't think it's still 100% unique.
Here's the proof that virtually any form of HASHBYTES can have a collision (explore the links at the article in the following link and do other Yabingooglehoo searches on the subject)..
https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
While new algorithms, such as SHA2-512, certainly and seriously reduce the risk of a collision, there are still no guarantees that there won't be a collision and it could happen on your very next check for dupes. It all depends on what on what level of risk you want to take and what the level of damage will be if you do happen to "hit the lottery". If you cannot afford the risk or cannot afford for any damage to occur, then you must do a secondary "brute force" check between the compared items that are returned as being duplicated even if you're using SHA-512.
Shifting gears a bit, the same holds true for the Type 4 GUIDs that MS has been using for quite some time (since mid '90's, IIRC). Although the number of possible values is quite large, they are NOT guaranteed to be "Globally Unique" despite their name. They're still "just": random numbers and collisions between different machines are possible especially since they came out with that bloody damned NEWSEQUENTIALID. That's one of the reasons why I store GUIDs that are used as a KEY in a column with a UNIQUE constraint (usually a PK, of course, just to add the NOT NULL part).
Further on the subject of GUIDs, I did the calculation a long time ago to give people the idea of magnitude of how many values are available. If 1 GUID represented 1 mile, there are enough unique values to traverse 14 Quadrillion Milky Way Galaxies. Even with that, there is still no guarantee of uniqueness. Heh... to coin a phrase, "Even galaxies collide".
--Jeff Moden
Change is inevitable... Change for the better is not.
July 1, 2017 at 8:43 am
Getting back to the original problem, there are 46 columns in the table, but how many of them are required to ensure uniqueness? You only need to partition by the columns you want to use for the dedupe. If you have an option of which set of columns to use, I'd try to make the set as narrow as possible. Consider any NCIs you have on the table. If you had a unique NCI already on the table, that would be ideal.
I know you already know the approach, but I'm suggesting that you focus on getting the query that identifies the duplicates as small as possible.
To see what would be deleted:
WITH cte AS (
SELECT RN = ROW_NUMBER() OVER(PARTITION BY col1, col2, col3 ORDER BY ID)
FROM dbo.YourLargeTable
)
SELECT *
FROM cte
WHERE RN > 1;
To do the actual delete, simply change the SELECT * to a DELETE.
I hope this helps.
July 1, 2017 at 1:14 pm
Jeff Moden - Saturday, July 1, 2017 7:09 AMxsevensinzx - Saturday, July 1, 2017 3:46 AMHowever, the trust issue is the big one that turns away most DBA's and I don't think it's still 100% unique.Here's the proof that virtually any form of HASHBYTES can have a collision (explore the links at the article in the following link and do other Yabingooglehoo searches on the subject)..
https://security.googleblog.com/2017/02/announcing-first-sha1-collision.htmlWhile new algorithms, such as SHA2-512, certainly and seriously reduce the risk of a collision, there are still no guarantees that there won't be a collision and it could happen on your very next check for dupes. It all depends on what on what level of risk you want to take and what the level of damage will be if you do happen to "hit the lottery". If you cannot afford the risk or cannot afford for any damage to occur, then you must do a secondary "brute force" check between the compared items that are returned as being duplicated even if you're using SHA-512.
Shifting gears a bit, the same holds true for the Type 4 GUIDs that MS has been using for quite some time (since mid '90's, IIRC). Although the number of possible values is quite large, they are NOT guaranteed to be "Globally Unique" despite their name. They're still "just": random numbers and collisions between different machines are possible especially since they came out with that bloody damned NEWSEQUENTIALID. That's one of the reasons why I store GUIDs that are used as a KEY in a column with a UNIQUE constraint (usually a PK, of course, just to add the NOT NULL part).
Further on the subject of GUIDs, I did the calculation a long time ago to give people the idea of magnitude of how many values are available. If 1 GUID represented 1 mile, there are enough unique values to traverse 14 Quadrillion Milky Way Galaxies. Even with that, there is still no guarantee of uniqueness. Heh... to coin a phrase, "Even galaxies collide".
Yep, pretty much. Not saying it was unique, but others have sold it as collision free. This is why it's best to just row_number() and partition over every column versus hashing it.
The reason why I mentioned the "trust" piece as such was because a lot of people don't understand how the hashing works and even then, if new algorithms come out, there will still be trust issues I'm sure among the community.
July 2, 2017 at 9:36 am
xsevensinzx - Saturday, July 1, 2017 1:14 PMYep, pretty much. Not saying it was unique, but others have sold it as collision free. This is why it's best to just row_number() and partition over every column versus hashing it.The reason why I mentioned the "trust" piece as such was because a lot of people don't understand how the hashing works and even then, if new algorithms come out, there will still be trust issues I'm sure among the community.
Ah... just to clarify. I wasn't saying anything on the contrary to what you said and hope you didn't take it that way. I absolutely agree with what you said and just used your good post to springboard into a rant about how many people continue to shoot themselves in the head by making assumptions. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply