Cleaning Duplicate Data

Question

Post reply

Cleaning Duplicate Data

James Tech

Right there with Babe

Points: 762
More actions
August 5, 2011 at 2:54 am

#245741

Hi, Everyone.
I need help in improving the performance of our clean up script to remove duplicate data by extracting the unique records from the source table and then inserting these records to a clean table.
I have attached the created scripts for the two tables and the stored procedure used. The clean tables are just a replica of these two tables and we just added an extension "_clean" to the table name.
What we are planning is to clean data per day and transfer it to a clean table then somehow update the clean records to the production server.
Additional Details
* "SampleTable" contains 41,125,020 Million records.
* "SampleTableDetails" contains 1,321,512,023 Billion records.
We tried to execute the stored procedure spSampleCleanup and here is the details of the timing:
Note: that this was on executed for our sample tables with the following records.
* "SampleTable" that contains 205,821 thousand records.
* "SampleTableDetails" that contains 5,129,312 million records.
+ Processed the 205,821 data in SampleTable and inserted the 180,213 records to the SampleTable_Clean table for 4 seconds.
+ Processed the 5,129,312 records in SampleTableDetails and inserted the 4,123,221 records to the SampleTableDetails_Clean table for 12 hours +.
Now, I don't know if this is how far the T-SQL query execution performance can go or there are some other ways to improve this.
Any help is highly appreciated!

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

BenWard SSCertifiable Points: 5903 More actions · Answer 1

dunno maybe this is more efficient?

INSERT INTO SampleTableDetails_Clean

SELECT *

FROM SampleTableDetails

WHERE SampleTableId IN (

SELECT * FROM @UniqueIds

)

AND DATED BETWEEN @startDate AND @endDate

changes to:

INSERT INTO SampleTableDetails_Clean

SELECT a.*

FROM SampleTableDetails a

JOIN @UniqueIds b

ON a.SampleTableId = b.SampleTableId

WHERE DATED BETWEEN @startDate AND @endDate

dont know about your experience but I've always found joining to be faster than where-in clauses.

edit -- need select a.* not select * where theres a join.

Ben

^ Thats me!

----------------------------------------
01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
----------------------------------------

BenWard SSCertifiable Points: 5903 More actions · Answer 2

and another thing (apologies if this is overly simplistic) but if at all possible make sure your server has more (preferably over double) RAM available to it than there is data in that table. otherwise you will find that the server will process 1.1billion rows in x minutes then return the remaining 0.2 billion rows in x days!

and, err, on your new server, add a unique key constraint on the table so you dont have to do this again!

Ben

^ Thats me!

----------------------------------------
01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
----------------------------------------

James Tech Right there with Babe Points: 762 More actions · Answer 3

Thanks for the input Ben. If I am going to re-index the table, would that help also?

Ron McCullough SSC Guru Points: 63877 More actions · Answer 4

Another approach is to "clean up" the duplicates by using a CTE. For example:

CREATE TABLE #T(Date_Stamp DATETIME,KW1 DECIMAL(5,1), KW2 DECIMAL(5,1))

INSERT INTO #T

SELECT '12/10/2010', 5.3, 3.1 UNION ALL

SELECT '12/9/2010', 4, 2 UNION ALL

SELECT '12/8/2010', 3, 1 UNION ALL

SELECT '12/7/2010', 7.4, 5 UNION ALL

SELECT '12/7/2010', 7.4, 5

;with cte

as (select row_number() over(partition by Date_Stamp,KW1,KW2 order by Date_Stamp) as rn,

KW1,KW2

from #T)

SELECT * FROM cte

Result:

rnKW1KW2

17.45.0

27.45.0

37.45.0

13.01.0

14.02.0

15.33.1

25.33.1

now using SELECT * FROM cte WHERE rn > 1 (selects the "unique" row)

Results:

17.45.0

13.01.0

14.02.0

15.33.1

Have not tested the above code on a record set as large as those you are processing but by selecting the "unique" row youl eliminate the need to insert into you "clean" table and might run in a short enough time to be adequate.

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

James Tech Right there with Babe Points: 762 More actions · Answer 5

bitbucket-25253 (8/5/2011)
Another approach is to "clean up" the duplicates by using a CTE. ....
Have not tested the above code on a record set as large as those you are processing but by selecting the "unique" row youl eliminate the need to insert into you "clean" table and might run in a short enough time to be adequate.

Thanks bitbucket, but I am already using CTE in my stored procedure..please see attachment on my first post.

Ron McCullough SSC Guru Points: 63877 More actions · Answer 6

James Tech (8/5/2011)
bitbucket-25253 (8/5/2011)
Another approach is to "clean up" the duplicates by using a CTE. ....
Have not tested the above code on a record set as large as those you are processing but by selecting the "unique" row you eliminate the need to insert into you "clean" table and might run in a short enough time to be adequate.
Thanks bitbucket, but I am already using CTE in my stored procedure..please see attachment on my first post.

Sorry as a matter of general practice, do not like opening attachments, not knowing what is in them. (Phobia against malicious software which may or may not be present)

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 7

I guess my first three recommendations would be...

1. Stop using a table variable for this. The data is way too big.

2. Partition the target tables by date.

3. Consider creating an archive plan here only the last 3 to 6 months of data is keep in the "in use" table and the rest is kept in a "mostly offline" table.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

James Tech Right there with Babe Points: 762 More actions · Answer 8

Jeff Moden (8/6/2011)
I guess my first three recommendations would be...
1. Stop using a table variable for this. The data is way too big.
2. Partition the target tables by date.
3. Consider creating an archive plan here only the last 3 to 6 months of data is keep in the "in use" table and the rest is kept in a "mostly offline" table.

Thanks Jeff and everyone.

1.) What would be the alternative to use for large data instead of table variable?

2.) Can you elaborate further when you partition the target tables by date? Is this using the PARTITION BY still? or physically partition the table?

Also, I notice something in the query that I provide...in the WHERE clause:

..WHERE DATED BETWEEN @startDate AND @endDate.

If I supply a date range e.g. WHERE DATED BETWEEN '2011/07/01 0:0:0' AND '2011/07/01 23:59:59:999', it will retrieve the July 2nd 00:00:00.000 records as well...Any thoughts? This is also applicable with other date range, it will retrieve the records a day after the end date.

3.) Please enlighten me on this item.

Thank you for your help Jeff.

Ron McCullough SSC Guru Points: 63877 More actions · Answer 9

Posted to incorrect forum .. sorry about that.

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 10

James Tech (8/7/2011)
Jeff Moden (8/6/2011)
I guess my first three recommendations would be...
1. Stop using a table variable for this. The data is way too big.
2. Partition the target tables by date.
3. Consider creating an archive plan here only the last 3 to 6 months of data is keep in the "in use" table and the rest is kept in a "mostly offline" table.
Thanks Jeff and everyone.
1.) What would be the alternative to use for large data instead of table variable?
2.) Can you elaborate further when you partition the target tables by date? Is this using the PARTITION BY still? or physically partition the table?
Also, I notice something in the query that I provide...in the WHERE clause:
..WHERE DATED BETWEEN @startDate AND @endDate.
If I supply a date range e.g. WHERE DATED BETWEEN '2011/07/01 0:0:0' AND '2011/07/01 23:59:59:999', it will retrieve the July 2nd 00:00:00.000 records as well...Any thoughts? This is also applicable with other date range, it will retrieve the records a day after the end date.
3.) Please enlighten me on this item.
Thank you for your help Jeff.

Sure...

On 1), a Table Variable will be perceived to contain only 1 row (under most circumstances) when the execution plan is made. That can cause some pretty nasty performance problems because SQL Server may do something like picking a loop join (etc) for smaller tables instead of (possibly) a merge or hash join which usually works better for larger tables.

On 2), I'm talking about partitioning as you would find in example "H" under "Create Table" in Books Online. Physical partitioning across multiple file groups to bring more read-write heads (if your disk system is setup as such) into play would probably also be a benefit.

On 3), The DateTime datatype has a resolution of 3.3 milliseconds and will, off course, round up to the next day for times using 23:59:59.999. You should never use a time when you're trying to get a whole day. If you were trying to get all of the data for all of yesterday without regard to the time of day, then you should use code like the following (assuming that @DesiredDate is a "whole date" with midnight time or no time element)...

WHERE somedatetimecolumn >= @DesiredDate AND somedatetimecolumn < DATEADD(dd,1,@DesiredDate )

Heh... and don't let anyone talk you into using 23:59:59.997 or any other time because it just makes a mess of things. 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)