April 1, 2014 at 4:19 am
Hi there,
I have a table that has multiple duplicate records. I do not want to truncate the table. I just want all unique records. As you see below, there is no primary key. The table has 400.000,000 records. --===== If the test table already exists, drop it
IF OBJECT_ID('TempDB..#mytable','U') IS NOT NULL
DROP TABLE #mytable
--===== Create the test table with
CREATE TABLE #mytable
(
ID varchar(10) not null,
name varchar(100) not null,
dateinserted datetime,
)
INSERT INTO #mytable
(ID, name, dateinserted)
SELECT 'TIX123','cocacola',getdate() UNION ALL
SELECT 'TIX123','cocacola',getdate() UNION ALL
SELECT 'TIX999','pepsi',getdate() UNION ALL
SELECT 'TIX523','nbc',getdate() UNION ALL
SELECT 'TIX999','pepsi',getdate() UNION ALL
I want to delete one record of cocacola & pepsi. If it's a small table, i can do this
WITH Ordered AS
(
SELECT*, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY Id) AS RN
FROM #mytable where ID
in ('TIX123','TIX999')
)
delete FROM Ordered
WHERE RN > 1;
How can I improve this query, if there are million records with different names?
Thanks
Rash
April 1, 2014 at 7:05 am
The only way I've been able to do this before is to delete in batches. The problem is, this is a slow process. I had to do with for multiple tables with hundreds of millions of rows and it required me to run this manually over a few weekends to complete.
April 1, 2014 at 7:59 am
So you have a temp table with 400 million rows??? Even if it is actually a persistent table why do you have a table that large with no primary key? Do you at least have a clustered index? Finding and removing duplicates from a 400 million row heap is going to be painfully slow no matter how you code it.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
April 1, 2014 at 8:13 am
@SSChampion, I do have a clustered index on ID. Table was created by somebody & had not been maintained until it ran out of space. So, i am assigned to take care of this & I am a newbie trying to figure out the best way to do this
April 1, 2014 at 8:16 am
So you have duplicate ID and name. Which dateinserted do you want to keep?
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
April 1, 2014 at 8:24 am
@SSChampion, many of these records have same dateinserted. so, we want one record per Unique ID, name, dateinserted
April 1, 2014 at 8:58 am
Maybe you can aggregate your data into another table, drop the current table and rename the newly created one?
Since you can't rename a temp table I changed your sample code to a persistent table as an example.
--===== If the test table already exists, drop it
IF OBJECT_ID('mytable','U') IS NOT NULL
DROP TABLE mytable
--===== Create the test table with
CREATE TABLE mytable
(
ID varchar(10) not null,
name varchar(100) not null,
dateinserted datetime,
)
INSERT INTO mytable
(ID, name, dateinserted)
SELECT 'TIX123','cocacola',getdate() UNION ALL
SELECT 'TIX123','cocacola',getdate() - 10 UNION ALL
SELECT 'TIX999','pepsi',getdate() UNION ALL
SELECT 'TIX523','nbc',getdate() UNION ALL
SELECT 'TIX999','pepsi',getdate() - 10
select ID, name, MIN(dateinserted) as DateInserted
into SomeTable
from mytable
group by ID, name
drop table mytable
exec sp_rename SomeTable, mytable
select * from mytable
drop table mytable
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
April 1, 2014 at 9:06 am
Since he has 400 million rows, should this be done with one statement or should this be done in batches? Moving hundreds of millions of records is probably going to take awhile.
April 1, 2014 at 9:43 am
JoshDBGuy (4/1/2014)
Since he has 400 million rows, should this be done with one statement or should this be done in batches? Moving hundreds of millions of records is probably going to take awhile.
Yes indeed it will. You could do those inserts in batches to ease some pressure for sure.
I might even restore a db backup to another server and do this. Just need to be careful that you capture any data changes in between the backup date and the current time.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
April 1, 2014 at 10:03 am
@SSChampion, By batches you mean maybe based on certain dates.
May 28, 2014 at 1:19 pm
You could try
WITH Ordered AS
(
SELECT*, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY Id) AS RN
FROM #mytable where ID
in ('TIX123','TIX999')
)
select * from ordered
where RN=1
to move the data to a new table, with SSIS or select into, then truncate the existing table. I think this would help with the processing time.
----------------------------------------------------
May 28, 2014 at 1:26 pm
MMartin1 (5/28/2014)
You could try
WITH Ordered AS
(
SELECT*, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY Id) AS RN
FROM #mytable where ID
in ('TIX123','TIX999')
)
select * from ordered
where RN=1
to move the data to a new table, with SSIS or select into, then truncate the existing table. I think this would help with the processing time.
This is pretty similar to what the OP tried before they posted. This is going to be horribly slow on 400 million rows.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
May 28, 2014 at 1:41 pm
With batches, it is meant you can try something like
WITH Ordered AS
(
SELECT*,
ROW_NUMBER() OVER ( PARTITION BY name ORDER BY Id) AS RN,
ROW_NUMBER() OVER ( ORDER BY id) as absolute_RN
FROM #mytable where ID
in ('TIX123','TIX999')
)
select *
into #MyWorkingTable
from ordered
GO
declare @low bigint;
declare @high bigint;
declare @max-2 bigint;
select @max-2=count(*) from #mytable;/*assumes you have more than 10k records the source table */
set @low=1;
set @high = @low+9999;
while @max-2>=@low
begin
delete from #myWorkingTable
where absolute_RN>=@low and absolute_RN<@high
and RN>1
set @low=@high;
set @high=@low+9999;
END
This brings the data with two new columns into another table. You work on that table, and would truncate he original and bring what is left over back into the original table. Be careful with log size and diskspace here though , since you will be making a copy of a large table. Since you have no keys I assume there is also no issue with FK? Another thing to consider.
----------------------------------------------------
May 28, 2014 at 1:43 pm
MMartin1 (5/28/2014)
You could try
WITH Ordered AS
(
SELECT*, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY Id) AS RN
FROM #mytable where ID
in ('TIX123','TIX999')
)
select * from ordered
where RN=1
to move the data to a new table, with SSIS or select into, then truncate the existing table. I think this would help with the processing time.
This is pretty similar to what the OP tried before they posted. This is going to be horribly slow on 400 million rows.
I don't think we can overcome that the process will be slow, just which is the faster of the slows.
----------------------------------------------------
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply