February 6, 2014 at 8:43 am
What best way would you suggest that I use /T-SQL to DEDUPLICATE this 800 million-row table? (it has almost 100 million FULL ROW duplicates (no PK (yet) ), with a count >1 after grouping by all fields..
ANY SUGGESTIONS?
thanks!
Likes to play Chess
February 6, 2014 at 8:50 am
Whatever way you do, it's not going to be super fast.
You may want to try
SELECT DISTINCT * INTO NewTable FROM ExistingTable
DROP TABLE ExistingTable
EXEC sp_rename 'NewTable' , 'ExistingTable'
February 6, 2014 at 8:52 am
Ouch. An 800 million row heap?
Given that amount of data and that performance is going to be painful here is one idea.
Do a select distinct into a new table. Then drop the original table. Finally rename the new table to the name of the original table. Make sure you do this in a test environment first.
Let's assume your base table is named MyTable and has 3 columns (Col1, Col2, Col3).
select distinct Col1, Col2, Col3
into MyNewTable
from MyTable;
drop table MyTable;
EXEC sp_rename 'dbo.MyNewTable', 'MyTable';
--EDIT--
Seems Eugene was faster at the keyboard than I was.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
February 6, 2014 at 9:30 am
With 800M rows, I would use a bulk export/import for this. I see 2 options:
Use bcp to export query with a SELECT DISTINCT (or GROUP BY all columns), truncate the table and bulk insert the data.
OR
USE bcp to export all the table, truncate the table, create a PK and bulk insert all data allowing to continue after 800M errors.
It might be slower than the other options (or not depending on how the commit behaves) but you won't to lose your permissions, or other metadata associated with the table.
February 6, 2014 at 9:44 am
Luis Cazares (2/6/2014)
With 800M rows, I would use a bulk export/import for this. I see 2 options:Use bcp to export query with a SELECT DISTINCT (or GROUP BY all columns), truncate the table and bulk insert the data.
OR
USE bcp to export all the table, truncate the table, create a PK and bulk insert all data allowing to continue after 800M errors.
It might be slower than the other options (or not depending on how the commit behaves) but you won't to lose your permissions, or other metadata associated with the table.
I guess if the table is a heap, it is unlikely to have a lot of metadata.
Scripting out and restoring identical permissions (and, other metadata if needed), should not be a big problem.
It will take quite a time to get 800 mils of rows cleaned using any method...
February 6, 2014 at 9:58 am
Sean Lange (2/6/2014)
Ouch. An 800 million row heap?Given that amount of data and that performance is going to be painful here is one idea.
Do a select distinct into a new table. Then drop the original table. Finally rename the new table to the name of the original table. Make sure you do this in a test environment first.
Let's assume your base table is named MyTable and has 3 columns (Col1, Col2, Col3).
select distinct Col1, Col2, Col3
into MyNewTable
from MyTable;
drop table MyTable;
EXEC sp_rename 'dbo.MyNewTable', 'MyTable';
--EDIT--
Seems Eugene was faster at the keyboard than I was.
I have had to do this on a number of occasions for clients. Copying the table as described here is often the fastest way of doing it.
Another option is to just select the dupes into a new table, dedupe from the new table etc etc etc.
But before going too far, I wonder if there is anyway whatsoever to distinguish these dupes from each other. Having any single field that is different can be a huge time saver when deduping this data.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
February 6, 2014 at 11:10 am
PROBLEMS with your advises so far:
1 - However if I use all these methods then I will have to face RECREATING INDEXES on a new table. There are several non-clustered indexes, each of them will probably easily take 4 or 5 hours to get built.
2 - I will also have to deal with disk space (copying from the table into another table and then dropping one later requires twice as much space as the orig table already occupies).
Are there any other ways that you would suggest that do not involve having to drop an orig. table?
Thanks.
Likes to play Chess
February 6, 2014 at 11:13 am
Well, if you could answer the last part of my previous post, then there may be other possibilities that would be acceptable and not take excessive resources.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
February 6, 2014 at 1:52 pm
<< I wonder if there is anyway whatsoever to distinguish these dupes from each other.>> well it is NOT KNOWN.
IT will take a while till we complete the study of how that can be addressed. waiting for a lot of input and meetings with the business users, etc. etc.
it will be a while before such question can be answered. weeks, if not months. so that is why I want to deduplicate the table, create another non-clustered IDx, -- the least I can do before any designe/re-design is possible.
Likes to play Chess
February 6, 2014 at 1:58 pm
VoldemarG (2/6/2014)
<< I wonder if there is anyway whatsoever to distinguish these dupes from each other.>> well it is NOT KNOWN.IT will take a while till we complete the study of how that can be addressed. waiting for a lot of input and meetings with the business users, etc. etc.
it will be a while before such question can be answered. weeks, if not months. so that is why I want to deduplicate the table, create another non-clustered IDx, -- the least I can do before any designe/re-design is possible.
If the assessment is not complete and it is unknown whether there is a way to distinguish the dupes from one another, I would not even venture anywhere close to deduping the data.
If there was a single attribute in the table that could help distinguish one dupe record from it's counterpart then that attribute could easily be used to help create a faster process.
The faster process could involve indexes.
It could involve creating a new table to dump just the dupes in, then delete from the master table based off this attribute that has been identified.
It could also be used to help more quickly delete the records in place in the table.
But for now, complete the data analysis and find a way to help distinguish these dupes.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
February 6, 2014 at 2:40 pm
I will go ahead and give you one possible solution that I have been holding back on because I feel it would be a huge resource hog with the number of records we are looking at here.
I would rather get more information and have you use a far more efficient means - once the data is proven to be sound and understood.
That said, you have this option for deduping (which may or may not severely hamper resources due to the number of dupes).
http://jasonbrimhall.info/2011/03/21/dedupe-data-cte/
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply