Another grouping problem

Question

Another grouping problem

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

Cadavre (5/15/2013)
ChrisM@Work (5/15/2013)
dwain.c (5/15/2013)
Sorry to disappoint you guys but I wasn't able to come up with anything.
My gut was telling me an rCTE wouldn't do it and the set-based loop I tried wouldn't resolve it out properly either.
Of course, I could be wrong as I'm a bit out of practice writing rCTEs! 🙂
I'm not surprised, Dwain. It's quite tricky. The more obvious methods like MAX() OVER() won't work when you expect them to, in the recursive part.
I haven't tried, as I said I'm pretty busy, but I thought a combination of a MAX() OVER() and a recursive CTE was going to do the business. Ah well.
Thanks for taking a look Dwain, be nice to see where your efforts took you.

There are at least two ways of approaching this problem, number chains and sets. I think the methods reviewed so far have employed number chains, so I had a play with sets.

Allocate each number pair a set identity, so the first row (ordered by whatever) has set identity = 1. Wherever either of the two members of set 1 appear elsewhere in the table, change the set identity for the pair to 1. Run again: wherever any of the (now many) members of set 1 appear elsewhere in the table in a different set, change the set id to 1. Rinse and repeat.

Here's the first attempt:

--DROP TABLE #Temp;

-- preprocess the sample data: there isn't a dupe and a master,

-- they're simply members of a set of two.

WITH SequencedData AS (

SELECT *, SetNo = ROW_NUMBER() OVER (ORDER BY Duplicate_ID, Master_ID)

FROM SCC

)

SELECT SetNo, Duplicate_ID

INTO #Temp

FROM SequencedData

UNION ALL

SELECT SetNo, Master_ID

FROM SequencedData;

-- (102 row(s) affected)

---------------------------------------------------------------------------

-- three different ways of performing the subquery for the update.

-- three updates have to be executed with the sample data to complete.

---------------------------------------------------------------------------

UPDATE t SET SetNo = c.newset

FROM #Temp t

INNER JOIN (

SELECT d.SetNo, newset = MIN(newset)

FROM (

SELECT Duplicate_ID, SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)

FROM #Temp

) d

GROUP BY d.SetNo

HAVING MIN(newset) <> d.SetNo

) c ON c.SetNo = t.SetNo;

-- (96 row(s) affected)

UPDATE t SET SetNo = c.newset

FROM #Temp t

INNER JOIN (

SELECT q.SetNo, q.newset

FROM (

SELECT

d.SetNo, d.newset, rn = ROW_NUMBER() OVER(PARTITION BY d.SetNo ORDER BY d.newset)

FROM (

SELECT SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)

FROM #Temp

) d

) q

WHERE rn = 1

AND q.SetNo <> q.newset

) c ON c.SetNo = t.SetNo;

-- (66 row(s) affected)

UPDATE t SET SetNo = c.newset

FROM #Temp t

INNER JOIN (

SELECT newset = MIN(t2.SetNo), t1.SetNo

FROM #Temp t1

INNER JOIN #Temp t2

ON t2.Duplicate_ID = t1.Duplicate_ID

AND t2.SetNo < t1.SetNo

GROUP BY t1.SetNo

) c ON c.SetNo = t.SetNo;

-- (2 row(s) affected)

---------------------------------------------------------------------------

-- Final result set

---------------------------------------------------------------------------

SELECT t.Duplicate, x.Duplicate_ID

FROM (

SELECT SetNo, Duplicate = MIN(Duplicate_ID)

FROM #Temp

GROUP BY SetNo

) t

CROSS APPLY (

SELECT DISTINCT *

FROM #Temp ti

WHERE ti.SetNo = t.SetNo

AND ti.Duplicate_ID <> t.Duplicate

) x;

Note that the subquery code for the three updates shown is logically the same and generates the same results with the sample data - I was playing with the subqueries in an attempt to find something which might be rCTE-compatible.

Three updates have to be performed, a fourth does no work. The results are the same as Abu Dina's and it's pretty darned quick.

I haven't yet discovered a rCTE to do the same process because you can't perform aggregations on the recursive part of a rCTE, which this process requires. You can chain ordinary CTE's though, and get the whole caboodle working in a single query like this:

;WITH DataSets AS (

SELECT

SetID = ROW_NUMBER() OVER(ORDER BY Master_ID, Duplicate_ID),

Master_ID, Duplicate_ID

FROM scc

),

FirstPass AS (

SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,

RowAffected = 0 + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END --

FROM DataSets d

OUTER APPLY ( -- find the lowest SetID that either number belongs to

SELECT NewSetID = MIN(di.SetID)

FROM DataSets di

WHERE di.SetID < d.SetID

AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))

) x

),

SecondPass AS (

SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,

RowAffected = RowAffected + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END

FROM FirstPass d

OUTER APPLY (

SELECT NewSetID = MIN(di.SetID)

FROM FirstPass di

WHERE di.SetID < d.SetID

AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))

) x

),

ThirdPass AS (

SELECT SetID = ISNULL(x.NewSetID, d.SetID), d.Master_ID, d.Duplicate_ID,

RowAffected = RowAffected + CASE WHEN x.NewSetID IS NOT NULL THEN 1 ELSE 0 END

FROM SecondPass d

OUTER APPLY (

SELECT NewSetID = MIN(di.SetID)

FROM SecondPass di

WHERE di.SetID < d.SetID

AND (di.Master_ID IN (d.Master_ID, d.Duplicate_ID) OR di.Duplicate_ID IN (d.Master_ID, d.Duplicate_ID))

) x

),

TweakedResults AS (

SELECT

SetID, Duplicate_ID, rn = ROW_NUMBER() OVER (PARTITION BY SetID ORDER BY Duplicate_ID)

FROM (

SELECT SetID, Duplicate_ID

FROM ThirdPass

UNION -- eliminate dupes

SELECT SetID, Master_ID

FROM ThirdPass

) d

)

SELECT

retained_id = t.Duplicate_ID,

dropped_id = x.Duplicate_ID

FROM TweakedResults t

CROSS APPLY (SELECT Duplicate_ID FROM TweakedResults ti WHERE ti.SetID = t.SetID AND ti.rn>1) x

WHERE t.rn = 1

- but the performance doesn't exactly shine. Have a look at the plan and you see why - the source table is read something like 8 times 😉

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Dwain Camps SSC Guru Points: 86978 More actions · Answer 2

ChrisM@Work (5/16/2013)
... the source table is read something like 8 times 😉

Only 8 times? I'd be surprised if my contribution is less than that.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Abu Dina SSChampion Points: 14155 More actions · Answer 3

This is still giving me problems! :unsure:

DECLARE @pairs TABLE (dropped_id INT, retained_id INT)

INSERT INTO @pairs

SELECT 16492, 23011 UNION ALL

SELECT 23011,24014 UNION ALL

SELECT 23011,24951 UNION ALL

SELECT 16492,24951 UNION ALL

SELECT 16492,29381 UNION ALL

SELECT 24951,29381 UNION ALL

SELECT 23011,29381 UNION ALL

SELECT 24014,66665 UNION ALL

SELECT 3344,66665

gives back: (using my method, Dwain's and chriM's)

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 4

Abu Dina (6/24/2013)
This is still giving me problems! :unsure:
DECLARE @pairs TABLE (dropped_id INT, retained_id INT)
INSERT INTO @pairs
SELECT 16492, 23011 UNION ALL
SELECT 23011,24014 UNION ALL
SELECT 23011,24951 UNION ALL
SELECT 16492,24951 UNION ALL
SELECT 16492,29381 UNION ALL
SELECT 24951,29381 UNION ALL
SELECT 23011,29381 UNION ALL
SELECT 24014,66665 UNION ALL
SELECT 3344,66665
gives back: (using my method, Dwain's and chriM's)

The last solution I posted uses a series of updates. Choose one of them and run it four times, something like this:

DECLARE @pairs TABLE (Duplicate_ID INT, Master_ID INT)

INSERT INTO @pairs

SELECT 16492, 23011 UNION ALL

SELECT 23011,24014 UNION ALL

SELECT 23011,24951 UNION ALL

SELECT 16492,24951 UNION ALL

SELECT 16492,29381 UNION ALL

SELECT 24951,29381 UNION ALL

SELECT 23011,29381 UNION ALL

SELECT 24014,66665 UNION ALL

SELECT 3344,66665;

-- preprocess the sample data: there isn't a dupe and a master,

-- they're simply members of a set of two.

IF object_id('TempDB..#Temp') IS NOT NULL DROP TABLE #Temp;

WITH SequencedData AS (

SELECT *, SetNo = ROW_NUMBER() OVER (ORDER BY Duplicate_ID, Master_ID)

FROM @pairs

)

SELECT SetNo, Duplicate_ID

INTO #Temp

FROM SequencedData

UNION ALL

SELECT SetNo, Master_ID

FROM SequencedData;

-- (102 row(s) affected)

---------------------------------------------------------------------------

-- FOUR updates have to be executed with the sample data to complete.

---------------------------------------------------------------------------

DECLARE @RowsUpdated INT

SET @RowsUpdated = 1

WHILE @RowsUpdated > 0

BEGIN

UPDATE t SET SetNo = c.newset

FROM #Temp t

INNER JOIN (

SELECT d.SetNo, newset = MIN(newset)

FROM (

SELECT Duplicate_ID, SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)

FROM #Temp

) d

GROUP BY d.SetNo

HAVING MIN(newset) <> d.SetNo

) c ON c.SetNo = t.SetNo;

SET @RowsUpdated = @@rowcount

END

---------------------------------------------------------------------------

-- Final result set

---------------------------------------------------------------------------

SELECT t.Duplicate, x.Duplicate_ID

FROM (

SELECT SetNo, Duplicate = MIN(Duplicate_ID)

FROM #Temp

GROUP BY SetNo

) t

CROSS APPLY (

SELECT DISTINCT *

FROM #Temp ti

WHERE ti.SetNo = t.SetNo

AND ti.Duplicate_ID <> t.Duplicate

) x;

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Abu Dina SSChampion Points: 14155 More actions · Answer 5

ChrisM@Work (6/24/2013)
Abu Dina (6/24/2013)
This is still giving me problems! :unsure:
DECLARE @pairs TABLE (dropped_id INT, retained_id INT)
INSERT INTO @pairs
SELECT 16492, 23011 UNION ALL
SELECT 23011,24014 UNION ALL
SELECT 23011,24951 UNION ALL
SELECT 16492,24951 UNION ALL
SELECT 16492,29381 UNION ALL
SELECT 24951,29381 UNION ALL
SELECT 23011,29381 UNION ALL
SELECT 24014,66665 UNION ALL
SELECT 3344,66665
gives back: (using my method, Dwain's and chriM's)
The last solution I posted uses a series of updates. Choose one of them and run it four times, something like this:
DECLARE @pairs TABLE (Duplicate_ID INT, Master_ID INT)
INSERT INTO @pairs
SELECT 16492, 23011 UNION ALL
SELECT 23011,24014 UNION ALL
SELECT 23011,24951 UNION ALL
SELECT 16492,24951 UNION ALL
SELECT 16492,29381 UNION ALL
SELECT 24951,29381 UNION ALL
SELECT 23011,29381 UNION ALL
SELECT 24014,66665 UNION ALL
SELECT 3344,66665;
-- preprocess the sample data: there isn't a dupe and a master,
-- they're simply members of a set of two.
IF object_id('TempDB..#Temp') IS NOT NULL DROP TABLE #Temp;
WITH SequencedData AS (
SELECT *, SetNo = ROW_NUMBER() OVER (ORDER BY Duplicate_ID, Master_ID)
FROM @pairs
)
SELECT SetNo, Duplicate_ID
INTO #Temp
FROM SequencedData
UNION ALL
SELECT SetNo, Master_ID
FROM SequencedData;
-- (102 row(s) affected)
---------------------------------------------------------------------------
-- FOUR updates have to be executed with the sample data to complete.
---------------------------------------------------------------------------
DECLARE @RowsUpdated INT
SET @RowsUpdated = 1
WHILE @RowsUpdated > 0
BEGIN
UPDATE t SET SetNo = c.newset
FROM #Temp t
INNER JOIN (
SELECT d.SetNo, newset = MIN(newset)
FROM (
SELECT Duplicate_ID, SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)
FROM #Temp
) d
GROUP BY d.SetNo
HAVING MIN(newset) <> d.SetNo
) c ON c.SetNo = t.SetNo;
SET @RowsUpdated = @@rowcount
END
---------------------------------------------------------------------------
-- Final result set
---------------------------------------------------------------------------
SELECT t.Duplicate, x.Duplicate_ID
FROM (
SELECT SetNo, Duplicate = MIN(Duplicate_ID)
FROM #Temp
GROUP BY SetNo
) t
CROSS APPLY (
SELECT DISTINCT *
FROM #Temp ti
WHERE ti.SetNo = t.SetNo
AND ti.Duplicate_ID <> t.Duplicate
) x;

Huh... I must have screwed up the code when I added it to the sproc. I swear it wasn't working before!

Anyhoo, superb work Chris. Thanks so much! That's another drink I owe you 😛

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

Abu Dina (6/24/2013)
ChrisM@Work (6/24/2013)
Abu Dina (6/24/2013)
This is still giving me problems! :unsure:
DECLARE @pairs TABLE (dropped_id INT, retained_id INT)
INSERT INTO @pairs
SELECT 16492, 23011 UNION ALL
SELECT 23011,24014 UNION ALL
SELECT 23011,24951 UNION ALL
SELECT 16492,24951 UNION ALL
SELECT 16492,29381 UNION ALL
SELECT 24951,29381 UNION ALL
SELECT 23011,29381 UNION ALL
SELECT 24014,66665 UNION ALL
SELECT 3344,66665
gives back: (using my method, Dwain's and chriM's)
The last solution I posted uses a series of updates. Choose one of them and run it four times, something like this:
DECLARE @pairs TABLE (Duplicate_ID INT, Master_ID INT)
INSERT INTO @pairs
SELECT 16492, 23011 UNION ALL
SELECT 23011,24014 UNION ALL
SELECT 23011,24951 UNION ALL
SELECT 16492,24951 UNION ALL
SELECT 16492,29381 UNION ALL
SELECT 24951,29381 UNION ALL
SELECT 23011,29381 UNION ALL
SELECT 24014,66665 UNION ALL
SELECT 3344,66665;
-- preprocess the sample data: there isn't a dupe and a master,
-- they're simply members of a set of two.
IF object_id('TempDB..#Temp') IS NOT NULL DROP TABLE #Temp;
WITH SequencedData AS (
SELECT *, SetNo = ROW_NUMBER() OVER (ORDER BY Duplicate_ID, Master_ID)
FROM @pairs
)
SELECT SetNo, Duplicate_ID
INTO #Temp
FROM SequencedData
UNION ALL
SELECT SetNo, Master_ID
FROM SequencedData;
-- (102 row(s) affected)
---------------------------------------------------------------------------
-- FOUR updates have to be executed with the sample data to complete.
---------------------------------------------------------------------------
DECLARE @RowsUpdated INT
SET @RowsUpdated = 1
WHILE @RowsUpdated > 0
BEGIN
UPDATE t SET SetNo = c.newset
FROM #Temp t
INNER JOIN (
SELECT d.SetNo, newset = MIN(newset)
FROM (
SELECT Duplicate_ID, SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)
FROM #Temp
) d
GROUP BY d.SetNo
HAVING MIN(newset) <> d.SetNo
) c ON c.SetNo = t.SetNo;
SET @RowsUpdated = @@rowcount
END
---------------------------------------------------------------------------
-- Final result set
---------------------------------------------------------------------------
SELECT t.Duplicate, x.Duplicate_ID
FROM (
SELECT SetNo, Duplicate = MIN(Duplicate_ID)
FROM #Temp
GROUP BY SetNo
) t
CROSS APPLY (
SELECT DISTINCT *
FROM #Temp ti
WHERE ti.SetNo = t.SetNo
AND ti.Duplicate_ID <> t.Duplicate
) x;
Huh... I must have screwed up the code when I added it to the sproc. I swear it wasn't working before!
Anyhoo, superb work Chris. Thanks so much! That's another drink I owe you 😛

Thanks for the feedback, Abu Dina. Are there enough beers for a pi$$up yet? 😀

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Abu Dina SSChampion Points: 14155 More actions · Answer 7

ChrisM@Work (6/25/2013)
Abu Dina (6/24/2013)
ChrisM@Work (6/24/2013)
Abu Dina (6/24/2013)
This is still giving me problems! :unsure:
DECLARE @pairs TABLE (dropped_id INT, retained_id INT)
INSERT INTO @pairs
SELECT 16492, 23011 UNION ALL
SELECT 23011,24014 UNION ALL
SELECT 23011,24951 UNION ALL
SELECT 16492,24951 UNION ALL
SELECT 16492,29381 UNION ALL
SELECT 24951,29381 UNION ALL
SELECT 23011,29381 UNION ALL
SELECT 24014,66665 UNION ALL
SELECT 3344,66665
gives back: (using my method, Dwain's and chriM's)
The last solution I posted uses a series of updates. Choose one of them and run it four times, something like this:
DECLARE @pairs TABLE (Duplicate_ID INT, Master_ID INT)
INSERT INTO @pairs
SELECT 16492, 23011 UNION ALL
SELECT 23011,24014 UNION ALL
SELECT 23011,24951 UNION ALL
SELECT 16492,24951 UNION ALL
SELECT 16492,29381 UNION ALL
SELECT 24951,29381 UNION ALL
SELECT 23011,29381 UNION ALL
SELECT 24014,66665 UNION ALL
SELECT 3344,66665;
-- preprocess the sample data: there isn't a dupe and a master,
-- they're simply members of a set of two.
IF object_id('TempDB..#Temp') IS NOT NULL DROP TABLE #Temp;
WITH SequencedData AS (
SELECT *, SetNo = ROW_NUMBER() OVER (ORDER BY Duplicate_ID, Master_ID)
FROM @pairs
)
SELECT SetNo, Duplicate_ID
INTO #Temp
FROM SequencedData
UNION ALL
SELECT SetNo, Master_ID
FROM SequencedData;
-- (102 row(s) affected)
---------------------------------------------------------------------------
-- FOUR updates have to be executed with the sample data to complete.
---------------------------------------------------------------------------
DECLARE @RowsUpdated INT
SET @RowsUpdated = 1
WHILE @RowsUpdated > 0
BEGIN
UPDATE t SET SetNo = c.newset
FROM #Temp t
INNER JOIN (
SELECT d.SetNo, newset = MIN(newset)
FROM (
SELECT Duplicate_ID, SetNo, newset = MIN(SetNo) OVER (PARTITION BY Duplicate_ID)
FROM #Temp
) d
GROUP BY d.SetNo
HAVING MIN(newset) <> d.SetNo
) c ON c.SetNo = t.SetNo;
SET @RowsUpdated = @@rowcount
END
---------------------------------------------------------------------------
-- Final result set
---------------------------------------------------------------------------
SELECT t.Duplicate, x.Duplicate_ID
FROM (
SELECT SetNo, Duplicate = MIN(Duplicate_ID)
FROM #Temp
GROUP BY SetNo
) t
CROSS APPLY (
SELECT DISTINCT *
FROM #Temp ti
WHERE ti.SetNo = t.SetNo
AND ti.Duplicate_ID <> t.Duplicate
) x;
Huh... I must have screwed up the code when I added it to the sproc. I swear it wasn't working before!
Anyhoo, superb work Chris. Thanks so much! That's another drink I owe you 😛
Thanks for the feedback, Abu Dina. Are there enough beers for a pi$$up yet? 😀

Haha.... plenty! I've lost count of how many time you've helped me!

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn