February 2, 2012 at 9:34 am
Hi I have a table with millions of records, some duplicates are needed in that they have the same customer details with different dates, how can I structure my query to select and remove only the exact duplicates? As you can probably tell I am a total noob with SQL!
Thanks
February 2, 2012 at 9:52 am
You can use a CTE (Common Table Expression) for example:
CREATE TABLE #T(Date_Stamp DATETIME,KW1 DECIMAL(5,1), KW2 DECIMAL(5,1))
INSERT INTO #T
SELECT '12/10/2010', 5.3, 3.1 UNION ALL
SELECT '12/10/2010', 5.3, 3.1 UNION ALL
SELECT '12/9/2010', 4, 2 UNION ALL
SELECT '12/8/2010', 3, 1 UNION ALL
SELECT '12/7/2010', 7.4, 5 UNION ALL
SELECT '12/7/2010', 7.4, 5 UNION ALL
SELECT '12/7/2010', 7.4, 5
;with cte
as (select row_number() over(partition by Date_Stamp,KW1,KW2 order by Date_Stamp) as rn,
KW1,KW2
from #T)
SELECT * FROM cte
Once you have tested, and tested this, change the SELECT to
SELECT * FROM cte WHERE rn > 1. This should return only the duplicate values, and once you are sure of that replace the SELECT statement with a
DELETE FROM cte WHERE rn > 1.
AGAIN let me emphasis the need to test and test again to be sure that it is returning those rows that you want to delete.
Please read:
http://www.simple-talk.com/sql/t-sql-programming/sql-server-cte-basics/
if you need to learn more about a CTE. The article contains sample code with which you can practice and understand how the CTE works, before using it in your situation
Edited after SQLRNNR posting immediately below this one.
By all means read SQLRNNR suggestion - it is excellent !!!
February 2, 2012 at 10:09 am
Here's another article to help understand the process.
http://jasonbrimhall.info/2011/03/21/dedupe-data-cte/
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
February 2, 2012 at 3:50 pm
Here's another way you may consider. This approach involves the following steps:
1) Fetch the duplicates into a temp table. (The customers with same date in this case)
2) Delete the duplicates from the main table.
3) Insert back single occurance of records that appear multiple times.
--The main table (#T)
CREATE TABLE #T(Date DATETIME,FirstName VARCHAR(15), LastName VARCHAR(15))
INSERT INTO #T
SELECT '12/10/2010', 'Thaddeus', 'Jones'UNION ALL
SELECT '12/10/2010', 'Thaddeus', 'Jones'UNION ALL
SELECT '12/9/2010', 'Thaddeus', 'Jones' UNION ALL
SELECT '12/8/2010', 'Hannibal', 'Hayes' UNION ALL
SELECT '12/7/2010', 'Rachel', 'Mccormick'UNION ALL
SELECT '12/7/2010', 'Dawn', 'Pinto'UNION ALL
SELECT '12/7/2010', 'Dawn', 'Pinto'
--This table (#DUPS) will hold the records that appear multiple times in the main table
CREATE TABLE #DUPS (Date DATETIME, FirstName VARCHAR(15), LastName VARCHAR(15))
INSERT INTO #DUPS
SELECT Date, FirstName, LastName
FROM #T
GROUP BY Date,FirstName, LastName
HAVING COUNT(*) > 1
/*Select * from #DUPS --will return 12/7/2010,Dawn Pinto and 12/10/2010,Thaddeus Jones as they appear more than once in the main table*/
Now, let us delete all the records from the main table that appear multiple times. If you have millions of records in your table you might want to do a batch delete to avoid log issues.
--Delete all the records from the main table that appear multiple times
DELETE a
FROM #T a
INNER JOIN #DUPS b
ON a.DATE = b.DATE
AND a.FirstName = b.FirstName
AND a.LastName = b.LastName
Once, we have deleted all the duplicates with this method, we have also deleted the records that we want to occur only a single time. So, we'll insert those records back into the main table from our #DUPS table.
INSERT INTO #T SELECT * FROM #DUPS
SELECT * FROM #T
DROP TABLE #T
DROP TABLE #DUPS
Finally, we have all the unique records in our main table including the same customers having different dates. 🙂
Date FirstName LastName
2010-12-07 00:00:00.000DawnPinto
2010-12-10 00:00:00.000ThaddeusJones
2010-12-09 00:00:00.000ThaddeusJones
2010-12-08 00:00:00.000HannibalHayes
2010-12-07 00:00:00.000RachelMccormick
Again, testing is the key for achieving smart solutions. 😉
--
"The greatest ignorance is being proud of your learning"
________________________________________________________________
"The greatest ignorance is being proud of your learning"
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply