Another grouping problem

Question

Another grouping problem

Abu Dina

SSChampion

Points: 14155
More actions
May 10, 2013 at 5:34 am

#293770

I have found a bug in my duplicate macthing application and I believe it's with the way I group my records at the end of the matching process.
Consider the following example (details are not real!)
52941 matches 8388 and 15799 because mobile number and name is the same
80833 matches 8388 and 15799 because name and address match
so this is same as saying 80833 matches 52941, 8388 AND 15799
So based on the second result set, I would like to perform a grouping of the outcome to look like this:
NOTE, I picked the highest value ID as my Base or Group ID.
Any suggestions?
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Cadavre SSC-Forever Points: 41690 More actions · Answer 1

You've been around long enough to know to provide consumable sample data.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Abu Dina SSChampion Points: 14155 More actions · Answer 2

Here you go:

create table SCC (Master_ID INT, Duplicate_ID INT)

INSERT INTO SCC

SELECT 52941, 8388 UNION ALL

SELECT 52941, 15799 UNION ALL

SELECT 80833, 8388 UNION ALL

SELECT 80833, 15799

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

Abu Dina SSChampion Points: 14155 More actions · Answer 3

My solution so far:

select DISTINCt c.baseID, CASE WHEN Master_ID = c.BaseID THEN a.Duplicate_ID else Master_ID END

from SCC as a

cross apply (select MAX(master_ID) AS BaseID

from SCC as b

where b.duplicate_id = a.duplicate_id) as c

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

Abu Dina SSChampion Points: 14155 More actions · Answer 4

Unfortunately my solution isn't working as expected. Anyone else with suggestions?

Here is some sample data:

with SCC_CTE (Retained_ID, Dropped_ID)

AS (Select 15799, 8388 union all

select 52941,8388 union all

select 52941, 15799 union all

select 80833, 8388 union all

select 80833, 15799)

select * from SCC_CTE

And here is the expected out output:

;with Result_CTE (Retained_ID, Dropped_ID)

AS (Select 80833, 8388 union all

select 80833,15799 union all

select 80833, 52941)

select * from Result_CTE

Thanks in advance.

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

Cadavre SSC-Forever Points: 41690 More actions · Answer 5

Sorry, I'd have replied sooner but things have been going on at work that take priority.

What you need is a recursive check through the list. So, I've knocked up a really quick demonstration, but I'd suggest looking into it yourself as I'm convinced that this is not the optimal query. Unfortunately, I'm a bit busy to dig in to it any further.

IF object_id('tempdb..#SCC') IS NOT NULL

BEGIN

DROP TABLE #SCC;

END;

--SAMPLE DATA

SELECT Retained_ID, Dropped_ID

INTO #SCC

FROM (VALUES(15799, 8388),(52941, 8388),(52941, 15799),(80833, 8388),(80833, 15799)

)a(Retained_ID, Dropped_ID);

-- EXPECTED RESULTS

SELECT Retained_ID, Dropped_ID

FROM (VALUES(80833, 8388),(80833, 15799),(80833, 52941))a(Retained_ID, Dropped_ID);

-- RECURSIVE QUERY

WITH CTE AS (

SELECT Retained_ID, Dropped_ID

FROM #SCC

UNION ALL

SELECT a.Retained_ID, CASE WHEN a.Retained_ID > b.Retained_ID THEN b.Retained_ID ELSE a.Dropped_ID END

FROM #SCC a

INNER JOIN CTE b ON a.Dropped_ID = b.Dropped_ID AND a.Retained_ID > b.Retained_ID)

SELECT DISTINCT ca.Retained_ID, ca.Dropped_ID

FROM CTE q

CROSS APPLY (SELECT TOP 1 a.Retained_ID, a.Dropped_ID

FROM CTE a

WHERE a.Dropped_ID = q.Dropped_ID

ORDER BY a.Retained_ID DESC

) ca;

--EDIT--

Almost immediately after posting, I realised that the recursive query could just as easily be written like this: -

-- RECURSIVE QUERY

WITH CTE AS (

SELECT Retained_ID, Dropped_ID

FROM #SCC

UNION ALL

SELECT a.Retained_ID, b.Retained_ID

FROM #SCC a

INNER JOIN CTE b ON a.Dropped_ID = b.Dropped_ID AND a.Retained_ID > b.Retained_ID)

SELECT MAX(Retained_ID) AS Retained_ID, Dropped_ID

FROM CTE

GROUP BY Dropped_ID;

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

Craig - very promising, but try adding these to the sample data set:

UNION ALL

SELECT 2426, 2341 UNION ALL

SELECT 2341, 1200

Abu Dina - are these two additional rows valid data, and if so, can you extend the data set please? I think it will help folks if there are more groups and perhaps a few rows where there are no associations at all.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Abu Dina SSChampion Points: 14155 More actions · Answer 7

Thanks for all your help so far.

This is one way I've tried and it works I think but I'm sure you clever folks can come up with a better way :

declare @pairs table (dropped_id int, retained_id int)

insert into @pairs

SELECT 34336, 14553 UNION ALL

SELECT 39352, 14553 UNION ALL

SELECT 39352, 34336 UNION ALL

SELECT 39776, 34123 UNION ALL

SELECT 47833, 14553 UNION ALL

SELECT 47833, 34336 UNION ALL

SELECT 47833, 39352 UNION ALL

SELECT 47969, 14553 UNION ALL

SELECT 47969, 34336 UNION ALL

SELECT 47969, 39352 UNION ALL

SELECT 47969, 47833 UNION ALL

SELECT 48460, 14553 UNION ALL

SELECT 48460, 34336 UNION ALL

SELECT 48460, 39352 UNION ALL

SELECT 48460, 47833 UNION ALL

SELECT 48460, 47969 UNION ALL

SELECT 48463, 14553 UNION ALL

SELECT 48463, 34336 UNION ALL

SELECT 48463, 39352 UNION ALL

SELECT 48463, 47833 UNION ALL

SELECT 48463, 47969 UNION ALL

SELECT 48463, 48460 UNION ALL

SELECT 52149, 14553 UNION ALL

SELECT 52149, 34336 UNION ALL

SELECT 52149, 39352 UNION ALL

SELECT 52149, 39776 UNION ALL

SELECT 52149, 47833 UNION ALL

SELECT 52149, 47969 UNION ALL

SELECT 52149, 48460 UNION ALL

SELECT 52149, 48463 UNION ALL

SELECT 54497, 14553 UNION ALL

SELECT 54497, 34336 UNION ALL

SELECT 54497, 39352 UNION ALL

SELECT 54497, 39776 UNION ALL

SELECT 54497, 47833 UNION ALL

SELECT 54497, 47969 UNION ALL

SELECT 54497, 48460 UNION ALL

SELECT 54497, 48463 UNION ALL

SELECT 54497, 52149 UNION ALL

SELECT 66316, 14553 UNION ALL

SELECT 66316, 34336 UNION ALL

SELECT 66316, 39352 UNION ALL

SELECT 66316, 47833 UNION ALL

SELECT 66316, 47969 UNION ALL

SELECT 66316, 48460 UNION ALL

SELECT 66316, 48463 UNION ALL

SELECT 66316, 48819 UNION ALL

SELECT 66316, 52149 UNION ALL

SELECT 66316, 54497 UNION ALL

SELECT 74054, 66316 UNION ALL

SELECT 77722, 53333

declare @sortedpairs table (retained_id int, dropped_id int)

insert into @sortedpairs (retained_id,dropped_id )

select retained_id ,dropped_id from @pairs where retained_id <dropped_id

insert into @sortedpairs (retained_id,dropped_id )

select dropped_id ,retained_id from @pairs where retained_id >dropped_id

insert into @sortedpairs (retained_ID,dropped_ID)

select sp1.retained_ID,sp2.retained_ID from @sortedpairs sp1

inner join @sortedpairs sp2

on sp1.dropped_ID=sp2.dropped_ID

and sp1.retained_ID<>sp2.retained_ID

and sp1.retained_ID<sp2.retained_ID

insert into @sortedpairs (dropped_ID,retained_ID)

select sp1.retained_ID,sp2.retained_ID from @sortedpairs sp1

inner join @sortedpairs sp2

on sp1.dropped_ID=sp2.dropped_ID

and sp1.retained_ID<>sp2.retained_ID

and sp1.retained_ID>sp2.retained_ID

while (@@rowcount<>0)

begin

update tab1 set tab1.retained_ID=tab2.retained_ID

from @sortedpairs tab1

inner join @sortedpairs tab2

on tab2.dropped_ID=tab1.retained_ID

end

select * from @sortedpairs

group by retained_ID,dropped_ID

order by 1

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

Cadavre SSC-Forever Points: 41690 More actions · Answer 8

I really don't have time to dig into this, work is a bet hectic. When I've got some spare time, I'll try to remember to revisit it. I'm fairly certain that this is a recursive CTE problem, so Dwain may well be your man (he loves recursive CTEs).

Just wondering, how many rows do you end up with in your working table? Taking a quick glance at your code, I'd imagine that as the number of rows increases in your source table, the number of rows in your working table must go up exponentially.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Abu Dina SSChampion Points: 14155 More actions · Answer 9

Join the club! I also have a number of deadlines looming so don't have the luxury to spend too much time on this.

Record linkage is a pain in the backside especially when you have changing spec from week to week!

Yes, the number of rows does increase considerably so although my multi-step method works, I know it's going to suffer performance issues as I try it with larger record sets.

For now though, I've tested it on a sample of 80k rows and my entire process from start to finish takes roughly 2 minute (this includes a dozen CLR TVFs) so I'm quite happy with it.

and yes, I can almost visualise the recursive CTE solution but as I said, I just don't have the time to mess around with it at the moment.

Thanks for your help so far, I do appreciate it.

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

Dwain Camps SSC Guru Points: 86978 More actions · Answer 10

Cadavre (5/13/2013)
I really don't have time to dig into this, work is a bet hectic. When I've got some spare time, I'll try to remember to revisit it. I'm fairly certain that this is a recursive CTE problem, so Dwain may well be your man (he loves recursive CTEs).

Dwain who?

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86978 More actions · Answer 11

Sorry to disappoint you guys but I wasn't able to come up with anything.

My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.

Of course, I could be wrong as I'm a bit out of practice writing rCTEs! 🙂

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 12

dwain.c (5/15/2013)
Sorry to disappoint you guys but I wasn't able to come up with anything.
My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.
Of course, I could be wrong as I'm a bit out of practice writing rCTEs! 🙂

I'm not surprised, Dwain. It's quite tricky. The more obvious methods like MAX() OVER() won't work when you expect them to, in the recursive part.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Cadavre SSC-Forever Points: 41690 More actions · Answer 13

ChrisM@Work (5/15/2013)
dwain.c (5/15/2013)
Sorry to disappoint you guys but I wasn't able to come up with anything.
My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.
Of course, I could be wrong as I'm a bit out of practice writing rCTEs! 🙂
I'm not surprised, Dwain. It's quite tricky. The more obvious methods like MAX() OVER() won't work when you expect them to, in the recursive part.

I haven't tried, as I said I'm pretty busy, but I thought a combination of a MAX() OVER() and a recursive CTE was going to do the business. Ah well.

Thanks for taking a look Dwain, be nice to see where your efforts took you.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Dwain Camps SSC Guru Points: 86978 More actions · Answer 14

Cadavre (5/15/2013)
Thanks for taking a look Dwain, be nice to see where your efforts took you.

Very well. For what it's worth:

DECLARE @pairs TABLE (dropped_id INT, retained_id INT)

INSERT INTO @pairs

--select 15799, 8388 union all

--select 52941,8388 union all

--select 52941, 15799 union all

--select 80833, 8388 union all

--select 80833, 15799

SELECT 34336, 14553 UNION ALL

SELECT 39352, 14553 UNION ALL

SELECT 39352, 34336 UNION ALL

SELECT 39776, 34123 UNION ALL

SELECT 47833, 14553 UNION ALL

SELECT 47833, 34336 UNION ALL

SELECT 47833, 39352 UNION ALL

SELECT 47969, 14553 UNION ALL

SELECT 47969, 34336 UNION ALL

SELECT 47969, 39352 UNION ALL

SELECT 47969, 47833 UNION ALL

SELECT 48460, 14553 UNION ALL

SELECT 48460, 34336 UNION ALL

SELECT 48460, 39352 UNION ALL

SELECT 48460, 47833 UNION ALL

SELECT 48460, 47969 UNION ALL

SELECT 48463, 14553 UNION ALL

SELECT 48463, 34336 UNION ALL

SELECT 48463, 39352 UNION ALL

SELECT 48463, 47833 UNION ALL

SELECT 48463, 47969 UNION ALL

SELECT 48463, 48460 UNION ALL

SELECT 52149, 14553 UNION ALL

SELECT 52149, 34336 UNION ALL

SELECT 52149, 39352 UNION ALL

SELECT 52149, 39776 UNION ALL

SELECT 52149, 47833 UNION ALL

SELECT 52149, 47969 UNION ALL

SELECT 52149, 48460 UNION ALL

SELECT 52149, 48463 UNION ALL

SELECT 54497, 14553 UNION ALL

SELECT 54497, 34336 UNION ALL

SELECT 54497, 39352 UNION ALL

SELECT 54497, 39776 UNION ALL

SELECT 54497, 47833 UNION ALL

SELECT 54497, 47969 UNION ALL

SELECT 54497, 48460 UNION ALL

SELECT 54497, 48463 UNION ALL

SELECT 54497, 52149 UNION ALL

SELECT 66316, 14553 UNION ALL

SELECT 66316, 34336 UNION ALL

SELECT 66316, 39352 UNION ALL

SELECT 66316, 47833 UNION ALL

SELECT 66316, 47969 UNION ALL

SELECT 66316, 48460 UNION ALL

SELECT 66316, 48463 UNION ALL

SELECT 66316, 48819 UNION ALL

SELECT 66316, 52149 UNION ALL

SELECT 66316, 54497 UNION ALL

SELECT 74054, 66316 UNION ALL

SELECT 77722, 53333

DECLARE @pairs2 TABLE (dropped_id INT, retained_id INT)

DECLARE @rows INT

INSERT INTO @pairs2

SELECT dropped_id=CASE WHEN dropped_id < retained_id THEN dropped_id ELSE retained_id END

,retained_id=CASE WHEN dropped_id < retained_id THEN retained_id ELSE dropped_id END

FROM @pairs

SELECT @rows = @@ROWCOUNT

DELETE FROM @pairs

INSERT INTO @pairs

SELECT dropped_id, retained_id

FROM @pairs2

WHERE dropped_id NOT IN (SELECT retained_id FROM @pairs2)

SELECT @rows = @@ROWCOUNT

WHILE @rows <> 0

BEGIN

INSERT INTO @pairs

--OUTPUT INSERTED.dropped_id, INSERTED.retained_id

SELECT a.dropped_id, b.retained_id

FROM @pairs a

INNER JOIN @pairs2 b ON a.retained_id = b.dropped_id

EXCEPT

SELECT dropped_id, retained_id

FROM @pairs

SELECT @rows = @@ROWCOUNT

END

INSERT INTO @pairs

SELECT a=MIN(dropped_id), b=MAX(dropped_id)

FROM @pairs

GROUP BY retained_id

HAVING COUNT(*) > 1

;WITH Pairs AS (

SELECT dropped_id, a.retained_id

,n=ROW_NUMBER() OVER (PARTITION BY a.retained_id ORDER BY dropped_id)

,b

FROM @pairs a

INNER JOIN (

SELECT retained_id, a=MAX(dropped_id), b=MIN(dropped_id)

FROM @pairs

GROUP BY retained_id

HAVING COUNT(*) > 1) b ON a.dropped_id = b.a OR a.retained_id = b.retained_id)

UPDATE a

SET dropped_id=b

FROM Pairs a

WHERE n > 1

SELECT DISTINCT *

FROM @pairs

ORDER BY dropped_id, retained_id

I tweaked it this morning and I think it may now be producing the correct rows for this case but no guarantee it always will.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St