August 11, 2004 at 9:34 am
I've been given a data file to load into my main database which has quite a few duplicates in it. The problem is that it's got no unique identifiers in it just address lines 1-6 and a postcode field.
I'm planning on doing a dedupe on address line 1 and postcode. The only problem now is I don't know how to write the dedupe. I've only ever deduped using an int field before. Could anyone please help me?
August 11, 2004 at 10:19 am
First, load into a staging table. Then you can use code like this:
-- Delete duplicates from mytable
alter table mytable
add id [int] IDENTITY (1, 1) NOT NULL
go
if exists (select * from dbo.sysobjects where id = object_id(N'[deleted_from_mytable]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [deleted_from_mytable]
GO
--save deleteds into temp table just in case
select c.* into deleted_from_mytable
From mytable c (nolock)
JOIN mytable c1 (nolock)
on c.[col1]+c.[col4] = c1.[col1]+c1.[col4]
where c.[id] > c1.[id]
go
--now delete the duplicates
delete c
From mytable c (nolock)
JOIN mytable c1 (nolock)
on c.[col1]+c.[col4] = c1.[col1]+c1.[col4]
where c.[id] > c1.[id]
go
ALTER TABLE mytable DROP COLUMN [id]
Bill
August 11, 2004 at 10:49 am
I'd agree with the above. Load into SQL and then dedup.
August 12, 2004 at 4:47 am
Thanks this is brilliant.
All my previous dedupes I've done have involved doing loops and therefore took a while, I knew a self join might be a way forward but never got it to work.
August 12, 2004 at 6:19 pm
If you have room for two copies of the data, I think this might run faster than the self-join and DELETE method.
First import the data into a staging table.
Then create another staging table with the same structure, and put a unique constraint on the dedupe fields using WITH IGNORE_DUP_KEY.
SELECT * INTO StageUnique FROM Staging WHERE 1=0
CREATE UNIQUE INDEX IX_StageUnique ON StageUnique (Addr1, PostCode)
WITH IGNORE_DUP_KEY
Now copy the data into the second table. All duplicates are rejected, which is faster than locating and deleting them later.
INSERT INTO StageUnique SELECT * FROM Staging ORDER BY Addr1, PostCode
You can add to the ORDER BY clause if you have a preference for which record among a set of duplicates is retained. For instance, you may want to keep records where Addr2 is defined if there are other records where it is missing.
INSERT INTO StageUnique SELECT * FROM Staging
ORDER BY Addr1, PostCode, CASE WHEN ISNULL(Addr2,'') = '' THEN 1 ELSE 0 END
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply