October 23, 2002 at 6:56 pm
I periodically get 12 million records with names, addresses and phone numbers. In the data there are some duplicate phone numbers. Where there are duplicates, I only want one of the records - which one does not matter. Is there an efficient way to get rid of the dups without running a cursor?
Thanks,
Dan
October 23, 2002 at 7:56 pm
Depending on how you get the records and what fields are duplicated you have multiple options. How many columns must have the same data to be considered a duplicate. If all then SELECT DISTINCT will handle. If less then is there any existing field such as a date which determines which to keep or is it just keep one. If keep one then setup a processing table with an IDENTITY column to key off of.
Then query something like this.
SELECT * FROM tblProc oq WHERE IDX = (SELECT TOP 1 IDX FROM tblProc iq WHERE iq.Col1 = oq.Col1 AND iq.Col2 = oq.Col2 AND iq.Col3 = oq.Col3 ORDER BY IDX)
-- Basically where the duplicating fields are equal.
Now if you have a date or other unique column and want to keep the last record query like this on the data (again if coming from another server may not need processing table)
SELECT * FROM tblProc oq WHERE ColDate = (SELECT Max(ColDate) FROM tblProc iq WHERE iq.Col1 = oq.Col1 AND iq.Col2 = oq.Col2 AND iq.Col3 = oq.Col3)
It all boils down to how you decide what to keep and what to throw away.
"Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)
October 24, 2002 at 4:54 am
A quick and dirty alternative is to make a copy of the table structure and create a unique index with ignore dupes on your key field(s). Then append the data, applying an order by that makes sense (keep newest, keep oldest, etc).
Andy
October 25, 2002 at 10:42 am
Hi,
Maybe you can try this if you want to avoid creating any temp table or using any cursor -->
SET ROWCOUNT 1
SELECT 'Start Removing Duplicate Record...' -- Don't remove, seed record
WHILE (@@ROWCOUNT>=1)
BEGIN
DELETE TABLE_1 WHERE COL_1 in ( SELECT COL_1 FROM TABLE_1 GROUP BY COL_1 HAVING COUNT(*) > 1)
END
SET ROWCOUNT 0
SELECT * FROM TABLE_1
Basically it just a WHILE..LOOP process to keep scanning TABLE_1, until there is not more duplicate record, @@ROWCOUNT will return 0 and the process will stop.
Another alternative/dirty way which I used to do that before I came out with this script is :
SELECT DISTINCT * INTO TEMP_TABLE FROM TABLE_1
TRUNCATE TABLE TABLE_1
INSERT INTO TABLE_1 SELECT * FROM TEMP_TABLE
I find this still the simplest way, but it required high IO resources.
Hope this help!!!
or you can reach me at :
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply