July 26, 2015 at 6:28 am
Comments posted to this topic are about the item Introducing the Set-based Loop
July 27, 2015 at 1:46 am
Hi Luis,
Interesting technique, thanks for sharing.
I must admit I'm surprised the CTE results were so much larger.
I'm trying to understand the rules that exist in the sample data using the test script provided.
As I understand it the Groups table contains all clients, their current group ID and their associated current membership ID for that group, yes?
You say that a client can belong to multiple groups but does that mean that a client can belong to multiple groups at the same time or only at different times (the latter represented by multiple chained grouphistory records)?
Each client will have at least one entry in grouphistory, and if they have moved groups the PreviousGroupId and PreviousMemberId fields would be filled in. This forms a self join to the clients previous history record. When we locate the oldest history record (i.e. the previous group / member id is null) we want to take these oldest group/member ids and insert them as other "originals" on the group table. Is that right?
If I am understanding correctly I would have expected the following assumptions to hold true:
SELECT COUNT(DISTINCT [ClientId]), COUNT(*)
FROM [dbo].[Groups]
--Expected result: both values match
--Actual result: both values match
SELECT COUNT(DISTINCT G.[ClientId]), COUNT(*)
FROM dbo.Groups G
JOIN dbo.GroupsHistory h ON g.GroupId = h.GroupId AND g.MemberId = h.MemberId
--Expected result: both values match
--Actual result: both values do not match. Indicates many-to-many.
SELECT MemberId,GroupId, COUNT(ClientId)
FROM [dbo].[Groups]
GROUP BY MemberId,GroupId
HAVING COUNT(*) > 1
--Expected result: no results
--Actual result: 42K results. Also indicates many-to-many
How can the original member/group ids be deterministically identified if the apparent many-to-many data relationships exist? Am i missing a rule or something?
(The answer may well be that I'm missing something due to my current state of illness / lack of sleep! 😉 )
Thanks,
Sam
July 27, 2015 at 8:14 am
Hi Sam,
Thank you for reading the article and commenting. You're absolutely right, there's an error with the generation for the sample data. There shouldn't be a many to many relationship for this example. Ironically, a very similar problem was present on the real data but I thought that it wasn't worth to include it in the article.
Here's the correct sample data generator with a change on the loop that inserts levels 2 to 6.
IF OBJECT_ID( 'dbo.GroupsHistory') IS NOT NULL --Prevent errors on reruns
DROP TABLE dbo.GroupsHistory;
CREATE TABLE dbo.GroupsHistory( --Create the table with relevant columns
MemberId int NOT NULL,
GroupId int NOT NULL,
PreviousMemberId int NULL,
PreviousGroupId int NULL,
StartDate date NOT NULL,
Level int NOT NULL --This column wasn't available on the original problem. It's here to help with test data generation.
);
IF OBJECT_ID( 'dbo.Groups') IS NOT NULL --Prevent errors on reruns
DROP TABLE dbo.Groups;
CREATE TABLE dbo.Groups( --Create the table with relevant columns
ClientId int IDENTITY NOT NULL,
MemberId int NOT NULL,
GroupId int NOT NULL,
OriginalGroupId int NULL,
OriginalMemberId int NULL
);
--Insert first level of test data
WITH E(n) AS(
SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)
),
TestData AS(
SELECT ABS( CHECKSUM(NEWID())) % 100000 CurrentGroupId,
DATEADD( dd, ABS( CHECKSUM(NEWID())) % 365, '20100101') StartDate
FROM E a, E b, E c, E d, E f, E g
)
INSERT INTO GroupsHistory(MemberId, GroupId, StartDate, Level)
SELECT ROW_NUMBER() OVER(PARTITION BY CurrentGroupId ORDER BY (SELECT NULL)),
CurrentGroupId,
StartDate,
1
FROM TestData;
--Generate the next levels for test data
DECLARE @Level int = 1;
WHILE @Level < 6
BEGIN
WITH SubTestData AS(
SELECT TOP 50 percent
(ABS( CHECKSUM(NEWID())) % 100000) + ( @Level * 100000) AS GroupId,
MemberId AS PreviousMemberId,
GroupId AS PreviousGroupId,
DATEADD( dd, ABS( CHECKSUM(NEWID())) % 365, StartDate) AS StartDate,
@Level + 1 AS Level
FROM dbo.GroupsHistory
WHERE Level = @Level
ORDER BY NEWID()
)
INSERT INTO GroupsHistory(MemberId, GroupId, PreviousMemberId, PreviousGroupId, StartDate, Level)
SELECT TOP 50 percent
ROW_NUMBER() OVER( PARTITION BY GroupId ORDER BY (SELECT NULL)),
GroupId,
PreviousMemberId,
PreviousGroupId,
StartDate,
Level
FROM SubTestData;
SET @Level += 1;
END;
INSERT INTO dbo.Groups(GroupId, MemberId)
SELECT GroupId, MemberId
FROM dbo.GroupsHistory g
WHERE NOT EXISTS(
SELECT 1
FROM dbo.GroupsHistory h
WHERE g.GroupId = h.PreviousGroupId
AND g.MemberId = h.PreviousMemberId);
This changes the performance on the rCTE making it faster than the RBAR using indexes. However, I'm curious on how the performance was affected negatively for the RBAR solution.
These timings were made on a different computer and shouldn't be compared directly to the ones posted on the article.
Iteration With_Index RBAR_Seconds rCTE_seconds While_seconds
-------------------- ---------- ------------ ------------ -------------
1 0 NULL 86 11
2 0 NULL 75 10
3 0 NULL 77 11
4 0 NULL 76 11
5 0 NULL 74 12
6 0 NULL 77 11
7 0 NULL 80 11
8 0 NULL 73 12
9 0 NULL 79 11
10 0 NULL 80 12
NULL 0 NULL 77 11
1 1 71 40 11
2 1 71 42 9
3 1 76 42 10
4 1 70 42 10
5 1 69 44 9
6 1 74 41 10
7 1 69 43 10
8 1 70 43 12
9 1 74 43 10
10 1 75 42 10
NULL 1 71 42 10
July 27, 2015 at 9:28 am
Terrific article, and thanks for it! It nicely shows how plan optimisation influences run-times and efficiency.
July 27, 2015 at 9:45 am
GoofyGuy (7/27/2015)
Terrific article, and thanks for it! It nicely shows how plan optimisation influences run-times and efficiency.
Thank you for the feedback. 🙂
July 27, 2015 at 10:11 am
One niggle. The looping-set update is not technically RBAR since it is performing groups of set-based operations. Maybe it should be called SBAS. 😀
I have used this method many a time to perform various tasks - and it is very useful in those cases. Even the use of a cursor is not always EVIL. It's a matter of having tested and documented as you instructed.
Nice job!
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 27, 2015 at 10:17 am
SQLRNNR (7/27/2015)
One niggle. The looping-set update is not technically RBAR since it is performing groups of set-based operations.
That was exactly the point of this article, to differentiate RBAR from set-based loops.
SQLRNNR (7/27/2015)
Nice job!
Thank you.
July 27, 2015 at 10:24 am
Luis Cazares (7/27/2015)
SQLRNNR (7/27/2015)
One niggle. The looping-set update is not technically RBAR since it is performing groups of set-based operations.That was exactly the point of this article, to differentiate RBAR from set-based loops.
Somehow I thought you were referring to the last three solutions as RBAR as well. Maybe it was the wording. Anywho...:Whistling:
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
July 27, 2015 at 2:11 pm
Good work indeed Luis, nicely done!
😎
July 27, 2015 at 3:50 pm
I also use this technique quite a bit. Besides being faster than the other methods I think it's easier to understand.
July 27, 2015 at 11:16 pm
Hi Luis,
It was a interesting article though highlighting the perception that has been taken up for the RBAR and set based data traversing in SQL. The major thing which i found highlighting is the Indexing that has been done on the table which makes the data parsing quicker which by default since the traverse will get the necessary information from the non clustered indexes, so ideally speaking is that any indexed huge data will support quick data access even RBAR is implemented?
Correct me if i am wrong.
Thanks
With warm Regards
Aswin Sankar M
July 28, 2015 at 3:37 am
Luis Cazares (7/27/2015)
Hi Sam,Thank you for reading the article and commenting. You're absolutely right, there's an error with the generation for the sample data. There shouldn't be a many to many relationship for this example. Ironically, a very similar problem was present on the real data but I thought that it wasn't worth to include it in the article.
...
This changes the performance on the rCTE making it faster than the RBAR using indexes. However, I'm curious on how the performance was affected negatively for the RBAR solution.
These timings were made on a different computer and shouldn't be compared directly to the ones posted on the article.
Hi Luis,
Thanks for the quick update to the test data generator!
I couldn't say why the RBAR solution was negatively affected (though I wonder if some index fragmentation might have crept in), but I have made an attempt at improving on the CTE approach.
I've run out of spare time to be more thorough but the best I could come up with (for using a CTE to do the recursion) was taking 18 seconds to execute.
(This is on a modest VM I have handy; but which achieved similar 17 second results to your own tests for the SBAS approach).
But I did resort to some trickery:
I split the process into two statements within a transaction. Using the rCTE for the recursion and followed with a simple update for the remainder.
I also used a different pair of indexes and a join hint to bring the execution time down by a couple more seconds.
(Note that I managed ~30 second execution using the rCTE for the full data set)
I note that the test data set this time around resulted in 75% of clients having no history so this approach is going to slow down (relatively to other approaches) depending on the skew of percent of total clients with history. This is because the simple update is faster than the recursive step.
I do agree that your SBAS technique is going to be the most efficient/appropriate for the requirements at hand because it avoids some otherwise unavoidable costs, I just wanted to give the CTE a fighting chance.
Here's my code:
CREATE UNIQUE CLUSTERED INDEX UCI_GroupsHistory ON dbo.GroupsHistory(MemberId ASC, GroupId ASC);
CREATE UNIQUE CLUSTERED INDEX CI_Groups ON dbo.Groups(MemberId ASC, GroupId ASC);
DECLARE
@SecondsrCTE int,
@Time datetime;
--Set the timer
SET @Time = GETDATE();
BEGIN TRAN;
WITH rCTE_TD AS (
SELECT
GH_c.MemberId AS JoinMemberId,
GH_c.GroupId AS JoinGroupId,
GH_c.MemberId AS LevelMemberId,
GH_c.GroupId AS LevelGroupId,
GH_c.PreviousMemberId,
GH_c.PreviousGroupId
FROM
dbo.GroupsHistory GH_c
WHERE EXISTS (
SELECT 1
FROM dbo.Groups G
WHERE G.MemberId = GH_c.MemberId
AND G.GroupId = GH_c.GroupId
)
ANDGH_c.PreviousMemberId IS NOT NULL
UNION ALL
SELECT
R.JoinMemberId,
R.JoinGroupId,
GH_p.MemberId,
GH_p.GroupId,
GH_p.PreviousMemberId,
GH_p.PreviousGroupId
FROM
rCTE_TD R
INNER JOIN
dbo.GroupsHistory GH_p
ON R.PreviousMemberId = GH_p.MemberId
AND R.PreviousGroupId = GH_p.GroupId
)
UPDATE G
SET
G.OriginalGroupId = rCTE_TD.LevelGroupId,
G.OriginalMemberId = rCTE_TD.LevelMemberId
FROM Groups G
INNER HASH JOIN rCTE_TD
ON rCTE_TD.JoinMemberId = G.MemberId
AND rCTE_TD.JoinGroupId = G.GroupId
WHERE rCTE_TD.PreviousMemberId IS NULL --Assume these are the originals
UPDATE G
SET
G.OriginalGroupId = GH.GroupId,
G.OriginalMemberId = GH.MemberId
FROM
dbo.GroupsHistory GH
INNER JOIN
dbo.Groups G
ON G.MemberId = GH.MemberId
AND G.GroupId = GH.GroupId
ANDGH.PreviousMemberId IS NULL
COMMIT TRAN;
--Get the time in seconds
SET @SecondsrCTE = DATEDIFF(ss, @Time, GETDATE());
SELECT @SecondsrCTE
If I get some more time I'll come back and post comparison results using the test harness.
Thanks,
Sam
July 28, 2015 at 3:46 am
Luis,
Why doesn't the GroupsHistory table have a ClientId column?
Why this wierd hierarchy - which it really isn't? It's more like a series.
July 28, 2015 at 5:35 am
Thanks for your great article.:-)
I have a problem with three “While”-loops and was searching for a better solution.
Instead of using the “The problematic solution” now I use your “Set-based Loop”. Thanks for bringing this Idea back in my mind.:-)
But I have some trouble with your article.
First Point: (it deals with the comment from sam.dahl (old Hand))
Code:
select COUNT ([Level]), MemberId, GroupId from GroupsHistory
group by MemberId, GroupId
order by 1 desc
I got some combinations MemberID, GroupId, where Count was greater 1. I select one of them to understand your code.
select * from GroupsHistory where GroupId=248498 and memberid=2
MemberIdGroupIdPreviousMemberIdPreviousGroupIdStartDateLevel
2248498151341932012-05-043
224849811352152010-12-253
224849821815002012-01-133
224849851381082011-08-213
224849881364762011-09-203
224849861629722011-08-113
224849831828602011-11-263
2248498111729962011-12-253
select * from GroupsHistory where PreviousGroupId=248498 and PreviousMemberId=2
MemberIdGroupIdPreviousMemberIdPreviousGroupIdStartDateLevel
632496522484982012-03-274
432478122484982011-12-294
1030978922484982011-12-164
331444522484982012-11-084
832387522484982012-08-074
930877522484982012-07-034
If I have a client with MemberID=6 and GroupID= 324965 in the next level MemberID =2 and GroupID = 248498.
But was is the PreviousPreviousMemberID or PreviousPreviousGroupID in the “The problematic solution” or the “Set-based Loop” ?
Second Point: instead of using the table “Groups “ in the “Set-based Loop”, I created a table Groups2. Both solutions should give the same resultset.But when I try this
select * from groups2 g2 inner join groups g1 on g1.clientid =g2.clientId where g1.OriginalMemberId<>g2.OriginalMemberId
I got 116 000 records. So there are some problems in the data. I can fix two problems (sbl= Set-based Loop; ps= problematic solution)
1.Resultset for client 9015: 901552158695661559015521586912921812
select * from GroupsHistory where GroupId = 215869 and memberid=5 -- 2 solutions level 3
select * from GroupsHistory where GroupId = 132273 and memberid=5 -- level 2; sbl
select * from GroupsHistory where GroupId = 129218 and memberid=12 -- level 2; ps finished
select * from GroupsHistory where GroupId = 56615 and memberid=5 -- level 1; sbl finished
select * from GroupsHistory where GroupId = 98116 and memberid=12 -- level 1; here is no ps (why?)
Both solutions uses different branches in die History-Table. The problematic solution goes only one step down.
2.Resultset for client 530537: 530537831067447018553053783106742490621
select * from GroupsHistory where GroupId = 310674 and memberid=8 -- level 4
select * from GroupsHistory where GroupId = 249062 and memberid=1 -- level 3 ps finished
select * from GroupsHistory where GroupId = 148545 and memberid=5 -- 2 at level 2
select * from GroupsHistory where GroupId = 47018 and memberid=5 -- level 1 sbl finished
select * from GroupsHistory where GroupId = 75661 and memberid=5 -- level 1
The problematic solution goes only one step down. The set-based Loop finished in both examples at level 1.
In the problematic solution the code to set the variable to NULL is at a wrong position, so the code leave the While-Branch to early.
Your Code:
WHILE EXISTS(
SELECT h.GroupId, h.MemberId
FROM dbo.GroupsHistory h
WHERE h.GroupId = @GroupId
AND h.MemberId = @MemberId)
BEGIN
SELECT @GroupId = PreviousGroupId,
@MemberId = PreviousMemberId
FROM GroupsHistory h
WHERE h.GroupId = @GroupId
AND h.MemberId = @MemberId;
IF @GroupId IS NULL
BREAK;
UPDATE g SET
OriginalGroupId = @GroupId,
OriginalMemberId = @MemberId
FROM Groups g
WHERE ClientId = @ClientId;
SET @GroupId = NULL; --wrong position
SET @MemberId = NULL; --wrong position
END
July 28, 2015 at 5:56 am
also the new testdata gives 60000 records for:
onclick:if_IFCode('','
');
select * from groups2 g2 inner join groups g1 on g1.clientid =g2.clientId where g1.OriginalMemberId<>g2.OriginalMemberId
Viewing 15 posts - 1 through 15 (of 49 total)
You must be logged in to reply to this topic. Login to reply