June 30, 2010 at 8:34 pm
I have a table of around 200k records – this table holds groups of duplicate records.
I have used ranking function (Dense rank) to identify each dupe grouping within this table.
Dupes are themselves identified by other processes which all work fine.
For example:-
FirstName , Surname , Dense_Rank
John , Smith , 1
J , Smith , 1
Liz , MacDough , 2
, MAcDough , 2
Data in table is derived from many different sources – for this example let’s call ‘em Source1, Source 2, Source 3 etc
My question is this: Where duplicates are identified across multiple Sources, how can they be fairly flagged as MATCH in this table. I mean fairly across all the sources - ensuring there is no discrimination against any Source.
This is where i am stuck and in need of solution that works?
As an example, if there are 100 duplicates found between the Source1 and Source 2, 50 are to be Flagged from the Source1 and 50 are to be Flagged from the Source 2.
If there are 90 duplicates across Sources 1 2 & 3, 30 are to be Flagged from each Source1, Source2, Source3. This is to ensure there is no discrimination against Source.
To further illustrate the problem in simple terms - Below we have found 2 sets of Duplicates.
For the first lot of Dupes (i.e Dense_Rank =1) there are 3 records grouped together 2 are from source1 and 1 record in this grouping is from Source2. We select only 1 record from Source1 to keep within this first grouping and Flag others in this first group as a MACTH.(These matched records will later be removed , to ensure every row is unique)
FirstName,Surname, Dense_Rank,Source, Match
John , Smith , 1, Source1,
J, Smith , 1, Source2, Y
Sam John, Smith , 1, Source1, Y
For the second lot of Dupes (Dense_Rank = 2) we have 4 records from Source1 and 2.
It would be unfair to select a Source1 record here to keep, So we need to choose a Source2 record to keep and flag everything else within the second lot of duplicate rows.
FirstName,Surname, Dense_Rank, Source, Match
Liz, MacDough, 2, Source1, Y
, MAcDough, 2, Source2,
Elisabeth,MaDoughie, 2, Source1, Y
EM, MAcDough, 2, Source2, Y
Finally we may have a grouping of Dupes from Source1,2 & 3.
Now if we select either a Source1 or Source2 record to keep here it would be unfair to Source3.
Thus to meet the fairness requirement we must select only one Source3 record within this grouping
And Flag all the other records in this group.
FirstName, Surname, Dense_Rank,Source, Match
Liz, Pappa, 3, Source1,Y
, Pappa, 3, Source2,Y
Elisabeth, Papadapolous, 3, Source3,
EM, Paps, 3,Source2, Y
Papa, Lizzie, 3, Source3, Y
Result so far – from the first 3 lot of duplicate groupings we have selected 1 record from each source to keep and flagged other dupes for removal later on.
Now how can I do this in Tsql?????
July 1, 2010 at 1:24 am
Probably there's a better solution, but this is what I could put together quick'n'dirty:
-- Test table
DECLARE @Names TABLE (
FirstName varchar(50),
Surname varchar(50),
Source int
)
-- 1st set of duplicates
INSERT INTO @Names values ('John', 'Smith', 1)
INSERT INTO @Names values ('J', 'Smith', 2)
INSERT INTO @Names values ('J.', 'Smith', 3)
-- 2nd set of duplicates
INSERT INTO @Names values ('Liz', 'MacDough', 1)
INSERT INTO @Names values ('', 'MACDOUGH', 2)
INSERT INTO @Names values ('L', 'MacDough', 3)
-- 3rd set of duplicates
INSERT INTO @Names values ('Gianluca', 'Sartori', 1)
INSERT INTO @Names values ('G', 'Sartori', 2)
INSERT INTO @Names values ('G.Luca', 'Sartori', 3)
-- 4th set of duplicates
INSERT INTO @Names values ('Bill', 'Gates', 1)
INSERT INTO @Names values ('B', 'Gates', 2)
INSERT INTO @Names values ('', 'Gates', 3)
-- 5th set of duplicates
INSERT INTO @Names values ('Steve', 'Jones', 1)
INSERT INTO @Names values ('', 'Jones', 2)
INSERT INTO @Names values ('S', 'Jones', 3)
-- Assign rank and find duplicates
;WITH Names AS (
SELECT *,
RK = DENSE_RANK() OVER (ORDER BY Surname ),
RN = ROW_NUMBER() OVER (PARTITION BY Surname ORDER BY Surname, Source)
FROM @Names
)
SELECT FirstName, Surname, Source,
Duplicate = CASE WHEN ISNULL(NULLIF(RK % 3,0),3) = RN THEN 1 ELSE 0 END
FROM Names
ORDER BY Surname, Source
Hope this helps
Gianluca
-- Gianluca Sartori
July 1, 2010 at 7:11 pm
Thankyou Gianluca.
Appreciate your reply.
I'll try this approach against the live data and see how it performs?
Would be very interested to see some different ways to (better) solve this problem.
Thanks again for taking time to answer.
Cheers
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply