May 10, 2013 at 5:34 am
I have found a bug in my duplicate macthing application and I believe it's with the way I group my records at the end of the matching process.
Consider the following example (details are not real!)
52941 matches 8388 and 15799 because mobile number and name is the same
80833 matches 8388 and 15799 because name and address match
so this is same as saying 80833 matches 52941, 8388 AND 15799
So based on the second result set, I would like to perform a grouping of the outcome to look like this:
NOTE, I picked the highest value ID as my Base or Group ID.
Any suggestions?
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
May 10, 2013 at 5:39 am
You've been around long enough to know to provide consumable sample data.
May 10, 2013 at 5:45 am
Here you go:
create table SCC (Master_ID INT, Duplicate_ID INT)
INSERT INTO SCC
SELECT 52941, 8388 UNION ALL
SELECT 52941, 15799 UNION ALL
SELECT 80833, 8388 UNION ALL
SELECT 80833, 15799
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
May 10, 2013 at 6:26 am
My solution so far:
select DISTINCt c.baseID, CASE WHEN Master_ID = c.BaseID THEN a.Duplicate_ID else Master_ID END
from SCC as a
cross apply (select MAX(master_ID) AS BaseID
from SCC as b
where b.duplicate_id = a.duplicate_id) as c
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
May 10, 2013 at 9:56 am
Unfortunately my solution isn't working as expected. Anyone else with suggestions?
Here is some sample data:
with SCC_CTE (Retained_ID, Dropped_ID)
AS (Select 15799, 8388 union all
select 52941,8388 union all
select 52941, 15799 union all
select 80833, 8388 union all
select 80833, 15799)
select * from SCC_CTE
And here is the expected out output:
;with Result_CTE (Retained_ID, Dropped_ID)
AS (Select 80833, 8388 union all
select 80833,15799 union all
select 80833, 52941)
select * from Result_CTE
Thanks in advance.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
May 13, 2013 at 3:19 am
Sorry, I'd have replied sooner but things have been going on at work that take priority.
What you need is a recursive check through the list. So, I've knocked up a really quick demonstration, but I'd suggest looking into it yourself as I'm convinced that this is not the optimal query. Unfortunately, I'm a bit busy to dig in to it any further.
IF object_id('tempdb..#SCC') IS NOT NULL
BEGIN
DROP TABLE #SCC;
END;
--SAMPLE DATA
SELECT Retained_ID, Dropped_ID
INTO #SCC
FROM (VALUES(15799, 8388),(52941, 8388),(52941, 15799),(80833, 8388),(80833, 15799)
)a(Retained_ID, Dropped_ID);
-- EXPECTED RESULTS
SELECT Retained_ID, Dropped_ID
FROM (VALUES(80833, 8388),(80833, 15799),(80833, 52941))a(Retained_ID, Dropped_ID);
-- RECURSIVE QUERY
WITH CTE AS (
SELECT Retained_ID, Dropped_ID
FROM #SCC
UNION ALL
SELECT a.Retained_ID, CASE WHEN a.Retained_ID > b.Retained_ID THEN b.Retained_ID ELSE a.Dropped_ID END
FROM #SCC a
INNER JOIN CTE b ON a.Dropped_ID = b.Dropped_ID AND a.Retained_ID > b.Retained_ID)
SELECT DISTINCT ca.Retained_ID, ca.Dropped_ID
FROM CTE q
CROSS APPLY (SELECT TOP 1 a.Retained_ID, a.Dropped_ID
FROM CTE a
WHERE a.Dropped_ID = q.Dropped_ID
ORDER BY a.Retained_ID DESC
) ca;
-- RECURSIVE QUERY
WITH CTE AS (
SELECT Retained_ID, Dropped_ID
FROM #SCC
UNION ALL
SELECT a.Retained_ID, b.Retained_ID
FROM #SCC a
INNER JOIN CTE b ON a.Dropped_ID = b.Dropped_ID AND a.Retained_ID > b.Retained_ID)
SELECT MAX(Retained_ID) AS Retained_ID, Dropped_ID
FROM CTE
GROUP BY Dropped_ID;
May 13, 2013 at 6:03 am
Craig - very promising, but try adding these to the sample data set:
UNION ALL
SELECT 2426, 2341 UNION ALL
SELECT 2341, 1200
Abu Dina - are these two additional rows valid data, and if so, can you extend the data set please? I think it will help folks if there are more groups and perhaps a few rows where there are no associations at all.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
May 13, 2013 at 6:30 am
Thanks for all your help so far.
This is one way I've tried and it works I think but I'm sure you clever folks can come up with a better way :
declare @pairs table (dropped_id int, retained_id int)
insert into @pairs
SELECT 34336, 14553 UNION ALL
SELECT 39352, 14553 UNION ALL
SELECT 39352, 34336 UNION ALL
SELECT 39776, 34123 UNION ALL
SELECT 47833, 14553 UNION ALL
SELECT 47833, 34336 UNION ALL
SELECT 47833, 39352 UNION ALL
SELECT 47969, 14553 UNION ALL
SELECT 47969, 34336 UNION ALL
SELECT 47969, 39352 UNION ALL
SELECT 47969, 47833 UNION ALL
SELECT 48460, 14553 UNION ALL
SELECT 48460, 34336 UNION ALL
SELECT 48460, 39352 UNION ALL
SELECT 48460, 47833 UNION ALL
SELECT 48460, 47969 UNION ALL
SELECT 48463, 14553 UNION ALL
SELECT 48463, 34336 UNION ALL
SELECT 48463, 39352 UNION ALL
SELECT 48463, 47833 UNION ALL
SELECT 48463, 47969 UNION ALL
SELECT 48463, 48460 UNION ALL
SELECT 52149, 14553 UNION ALL
SELECT 52149, 34336 UNION ALL
SELECT 52149, 39352 UNION ALL
SELECT 52149, 39776 UNION ALL
SELECT 52149, 47833 UNION ALL
SELECT 52149, 47969 UNION ALL
SELECT 52149, 48460 UNION ALL
SELECT 52149, 48463 UNION ALL
SELECT 54497, 14553 UNION ALL
SELECT 54497, 34336 UNION ALL
SELECT 54497, 39352 UNION ALL
SELECT 54497, 39776 UNION ALL
SELECT 54497, 47833 UNION ALL
SELECT 54497, 47969 UNION ALL
SELECT 54497, 48460 UNION ALL
SELECT 54497, 48463 UNION ALL
SELECT 54497, 52149 UNION ALL
SELECT 66316, 14553 UNION ALL
SELECT 66316, 34336 UNION ALL
SELECT 66316, 39352 UNION ALL
SELECT 66316, 47833 UNION ALL
SELECT 66316, 47969 UNION ALL
SELECT 66316, 48460 UNION ALL
SELECT 66316, 48463 UNION ALL
SELECT 66316, 48819 UNION ALL
SELECT 66316, 52149 UNION ALL
SELECT 66316, 54497 UNION ALL
SELECT 74054, 66316 UNION ALL
SELECT 77722, 53333
declare @sortedpairs table (retained_id int, dropped_id int)
insert into @sortedpairs (retained_id,dropped_id )
select retained_id ,dropped_id from @pairs where retained_id <dropped_id
insert into @sortedpairs (retained_id,dropped_id )
select dropped_id ,retained_id from @pairs where retained_id >dropped_id
insert into @sortedpairs (retained_ID,dropped_ID)
select sp1.retained_ID,sp2.retained_ID from @sortedpairs sp1
inner join @sortedpairs sp2
on sp1.dropped_ID=sp2.dropped_ID
and sp1.retained_ID<>sp2.retained_ID
and sp1.retained_ID<sp2.retained_ID
insert into @sortedpairs (dropped_ID,retained_ID)
select sp1.retained_ID,sp2.retained_ID from @sortedpairs sp1
inner join @sortedpairs sp2
on sp1.dropped_ID=sp2.dropped_ID
and sp1.retained_ID<>sp2.retained_ID
and sp1.retained_ID>sp2.retained_ID
while (@@rowcount<>0)
begin
update tab1 set tab1.retained_ID=tab2.retained_ID
from @sortedpairs tab1
inner join @sortedpairs tab2
on tab2.dropped_ID=tab1.retained_ID
end
select * from @sortedpairs
group by retained_ID,dropped_ID
order by 1
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
May 13, 2013 at 10:36 am
I really don't have time to dig into this, work is a bet hectic. When I've got some spare time, I'll try to remember to revisit it. I'm fairly certain that this is a recursive CTE problem, so Dwain may well be your man (he loves recursive CTEs).
Just wondering, how many rows do you end up with in your working table? Taking a quick glance at your code, I'd imagine that as the number of rows increases in your source table, the number of rows in your working table must go up exponentially.
May 13, 2013 at 10:53 am
Join the club! I also have a number of deadlines looming so don't have the luxury to spend too much time on this.
Record linkage is a pain in the backside especially when you have changing spec from week to week!
Yes, the number of rows does increase considerably so although my multi-step method works, I know it's going to suffer performance issues as I try it with larger record sets.
For now though, I've tested it on a sample of 80k rows and my entire process from start to finish takes roughly 2 minute (this includes a dozen CLR TVFs) so I'm quite happy with it.
and yes, I can almost visualise the recursive CTE solution but as I said, I just don't have the time to mess around with it at the moment.
Thanks for your help so far, I do appreciate it.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
May 15, 2013 at 2:43 am
Cadavre (5/13/2013)
I really don't have time to dig into this, work is a bet hectic. When I've got some spare time, I'll try to remember to revisit it. I'm fairly certain that this is a recursive CTE problem, so Dwain may well be your man (he loves recursive CTEs).
Dwain who?
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
May 15, 2013 at 6:22 am
Sorry to disappoint you guys but I wasn't able to come up with anything.
My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.
Of course, I could be wrong as I'm a bit out of practice writing rCTEs! π
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
May 15, 2013 at 6:26 am
dwain.c (5/15/2013)
Sorry to disappoint you guys but I wasn't able to come up with anything.My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.
Of course, I could be wrong as I'm a bit out of practice writing rCTEs! π
I'm not surprised, Dwain. It's quite tricky. The more obvious methods like MAX() OVER() won't work when you expect them to, in the recursive part.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
May 15, 2013 at 8:44 am
ChrisM@Work (5/15/2013)
dwain.c (5/15/2013)
Sorry to disappoint you guys but I wasn't able to come up with anything.My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.
Of course, I could be wrong as I'm a bit out of practice writing rCTEs! π
I'm not surprised, Dwain. It's quite tricky. The more obvious methods like MAX() OVER() won't work when you expect them to, in the recursive part.
I haven't tried, as I said I'm pretty busy, but I thought a combination of a MAX() OVER() and a recursive CTE was going to do the business. Ah well.
Thanks for taking a look Dwain, be nice to see where your efforts took you.
May 15, 2013 at 6:16 pm
Cadavre (5/15/2013)
Thanks for taking a look Dwain, be nice to see where your efforts took you.
Very well. For what it's worth:
DECLARE @pairs TABLE (dropped_id INT, retained_id INT)
INSERT INTO @pairs
--select 15799, 8388 union all
--select 52941,8388 union all
--select 52941, 15799 union all
--select 80833, 8388 union all
--select 80833, 15799
SELECT 34336, 14553 UNION ALL
SELECT 39352, 14553 UNION ALL
SELECT 39352, 34336 UNION ALL
SELECT 39776, 34123 UNION ALL
SELECT 47833, 14553 UNION ALL
SELECT 47833, 34336 UNION ALL
SELECT 47833, 39352 UNION ALL
SELECT 47969, 14553 UNION ALL
SELECT 47969, 34336 UNION ALL
SELECT 47969, 39352 UNION ALL
SELECT 47969, 47833 UNION ALL
SELECT 48460, 14553 UNION ALL
SELECT 48460, 34336 UNION ALL
SELECT 48460, 39352 UNION ALL
SELECT 48460, 47833 UNION ALL
SELECT 48460, 47969 UNION ALL
SELECT 48463, 14553 UNION ALL
SELECT 48463, 34336 UNION ALL
SELECT 48463, 39352 UNION ALL
SELECT 48463, 47833 UNION ALL
SELECT 48463, 47969 UNION ALL
SELECT 48463, 48460 UNION ALL
SELECT 52149, 14553 UNION ALL
SELECT 52149, 34336 UNION ALL
SELECT 52149, 39352 UNION ALL
SELECT 52149, 39776 UNION ALL
SELECT 52149, 47833 UNION ALL
SELECT 52149, 47969 UNION ALL
SELECT 52149, 48460 UNION ALL
SELECT 52149, 48463 UNION ALL
SELECT 54497, 14553 UNION ALL
SELECT 54497, 34336 UNION ALL
SELECT 54497, 39352 UNION ALL
SELECT 54497, 39776 UNION ALL
SELECT 54497, 47833 UNION ALL
SELECT 54497, 47969 UNION ALL
SELECT 54497, 48460 UNION ALL
SELECT 54497, 48463 UNION ALL
SELECT 54497, 52149 UNION ALL
SELECT 66316, 14553 UNION ALL
SELECT 66316, 34336 UNION ALL
SELECT 66316, 39352 UNION ALL
SELECT 66316, 47833 UNION ALL
SELECT 66316, 47969 UNION ALL
SELECT 66316, 48460 UNION ALL
SELECT 66316, 48463 UNION ALL
SELECT 66316, 48819 UNION ALL
SELECT 66316, 52149 UNION ALL
SELECT 66316, 54497 UNION ALL
SELECT 74054, 66316 UNION ALL
SELECT 77722, 53333
DECLARE @pairs2 TABLE (dropped_id INT, retained_id INT)
DECLARE @rows INT
INSERT INTO @pairs2
SELECT dropped_id=CASE WHEN dropped_id < retained_id THEN dropped_id ELSE retained_id END
,retained_id=CASE WHEN dropped_id < retained_id THEN retained_id ELSE dropped_id END
FROM @pairs
SELECT @rows = @@ROWCOUNT
DELETE FROM @pairs
INSERT INTO @pairs
SELECT dropped_id, retained_id
FROM @pairs2
WHERE dropped_id NOT IN (SELECT retained_id FROM @pairs2)
SELECT @rows = @@ROWCOUNT
WHILE @rows <> 0
BEGIN
INSERT INTO @pairs
--OUTPUT INSERTED.dropped_id, INSERTED.retained_id
SELECT a.dropped_id, b.retained_id
FROM @pairs a
INNER JOIN @pairs2 b ON a.retained_id = b.dropped_id
EXCEPT
SELECT dropped_id, retained_id
FROM @pairs
SELECT @rows = @@ROWCOUNT
END
INSERT INTO @pairs
SELECT a=MIN(dropped_id), b=MAX(dropped_id)
FROM @pairs
GROUP BY retained_id
HAVING COUNT(*) > 1
;WITH Pairs AS (
SELECT dropped_id, a.retained_id
,n=ROW_NUMBER() OVER (PARTITION BY a.retained_id ORDER BY dropped_id)
,b
FROM @pairs a
INNER JOIN (
SELECT retained_id, a=MAX(dropped_id), b=MIN(dropped_id)
FROM @pairs
GROUP BY retained_id
HAVING COUNT(*) > 1) b ON a.dropped_id = b.a OR a.retained_id = b.retained_id)
UPDATE a
SET dropped_id=b
FROM Pairs a
WHERE n > 1
SELECT DISTINCT *
FROM @pairs
ORDER BY dropped_id, retained_id
I tweaked it this morning and I think it may now be producing the correct rows for this case but no guarantee it always will.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply