Introducing the Set-based Loop

It’s a common assumption that when used in SQL, WHILE loops (often used to traverse a cursor) are bad for performance. While most of the time this seems to be true, the problem isn’t the WHILE loop, but rather the operations performed in it. The performance problem comes from the iterative processing or RBAR (RBAR is a “Modenism” for “Row-By-Agonizing-Row”. Coined by Jeff Moden). As it turns out, WHILE loops are often used by programmers with procedural programming background and lesser understanding of set-based methodologies, which also are more likely to use them in conjunction with single row or RBAR operations. Such programmers fail to understand that SQL is a Declarative language and that SQL Server is a relational database engine optimized for non-sequential set-based queries.

In this article, I’ll try to show the difference between a RBAR loop, a recursive CTE and a set-based loop. When the latter was applied to a real life problem, it improved performance of a query that ran for over 25 hours and wasn’t able to complete, to a query that finished in two minutes. The table had over 15 million rows and was constantly growing.

The original problem

Let me give some background information. A money lending business requires that clients belong to a group in order to obtain a loan. Clients can belong to different groups, but will be assigned to an employee based on its original group. The information available consists of a table with clients and their current group affiliation and a table showing group history movements. There’s no way to directly identify the first group for each client and therefor there’s no way to obtain the correct employee for that client without traversing the hierarchy. Each solution presented in here will traverse the hierarchy to define the original group for each client and allow us to assign the correct employee.

The following code will generate sample data to give you an idea on how the data was presented.

USE Test; --Be sure that you're on a safe database. You could use tempdb as well.
GO
IF OBJECT_ID( 'dbo.GroupsHistory') IS NOT NULL --Prevent errors on reruns
    DROP TABLE dbo.GroupsHistory;
CREATE TABLE dbo.GroupsHistory( --Create the table with relevant columns
    MemberId            int NOT NULL,
    GroupId             int NOT NULL,
    PreviousMemberId    int NULL,
    PreviousGroupId     int NULL,
    StartDate           date NOT NULL,
    Level               int NOT NULL --This column wasn't available on the original problem. It's here to help with test data generation.
);
IF OBJECT_ID( 'dbo.Groups') IS NOT NULL --Prevent errors on reruns
    DROP TABLE dbo.Groups;
CREATE TABLE dbo.Groups( --Create the table with relevant columns
    ClientId            int IDENTITY NOT NULL,
    MemberId            int NOT NULL,
    GroupId             int NOT NULL,
    OriginalGroupId     int NULL,
    OriginalMemberId    int NULL
);
--Insert first level of test data
WITH E(n) AS(
    SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)
),
TestData AS(
    SELECT ABS( CHECKSUM(NEWID())) % 100000 CurrentGroupId,
        DATEADD( dd, ABS( CHECKSUM(NEWID())) % 365, '20100101') StartDate
    FROM E a, E b, E c, E d, E f, E g
)
INSERT INTO GroupsHistory(MemberId, GroupId, StartDate, Level)
SELECT ROW_NUMBER() OVER(PARTITION BY CurrentGroupId ORDER BY (SELECT NULL)),
    CurrentGroupId,
    StartDate,
    1
FROM TestData;
--Generate the next levels for test data
DECLARE @Level int = 1;
WHILE @Level < 6
BEGIN
    INSERT INTO GroupsHistory(MemberId, GroupId, PreviousMemberId, PreviousGroupId, StartDate, Level)
    SELECT TOP 50 percent
        ROW_NUMBER() OVER( PARTITION BY GroupId ORDER BY (SELECT NULL)),
        ABS( CHECKSUM(NEWID())) % (100000 / @Level) + ( @Level * 100000),
        MemberId,
        GroupId,
        DATEADD( dd, ABS( CHECKSUM(NEWID())) % 365, StartDate) StartDate,
        @Level + 1
    FROM dbo.GroupsHistory
    WHERE Level = @Level
    ORDER BY NEWID();
    SET @Level += 1;
END;
INSERT INTO dbo.Groups(GroupId, MemberId)
SELECT GroupId, MemberId
FROM dbo.GroupsHistory g
WHERE NOT EXISTS(
    SELECT 1
    FROM GroupsHistory h
    WHERE g.GroupId = h.PreviousGroupId
    AND g.MemberId = h.PreviousMemberId);

This will generate between 800,000 and 900,000 clients (according to my tests) with their corresponding history which will contain almost 2 million rows. This volume of data will allow us to evaluate performance in a noticeable way even if it doesn’t completely imitate the original data.

The problematic solution

The initial solution had a cursor with a nested loop that would update each client with the previous group until there wasn’t a previous group to update. This wonderful piece of procedural code ran for 25 hours and didn’t even update half of the clients’ original groups. When it ran for over 12 hours during the night and the following day, someone decided to ask for help.

I’ve rebuild the code from memory as the original code was destroyed, shredded and completely erased so it wouldn’t return again. It is a classic example of RBAR on steroids that can only crawl instead of run.

DECLARE @ClientId int,
    @CurrentGroupId int,
    @CurrentMemberId int,
    @GroupId int,
    @MemberId int;
DECLARE GroupMembers CURSOR FOR
SELECT g.ClientId, g.GroupId, g.MemberId
FROM dbo.Groups g
WHERE OriginalGroupId IS NULL
OR OriginalMemberId IS NULL;
OPEN GroupMembers;
FETCH NEXT FROM GroupMembers INTO @ClientId, @CurrentGroupId, @CurrentMemberId;
WHILE @@FETCH_STATUS = 0
BEGIN
    SET @GroupId = @CurrentGroupId;
    SET @MemberId = @CurrentMemberId;
    WHILE EXISTS(
        SELECT h.GroupId, h.MemberId
        FROM dbo.GroupsHistory h
        WHERE h.GroupId = @GroupId
        AND h.MemberId = @MemberId)
    BEGIN
        SELECT @GroupId = PreviousGroupId,
            @MemberId = PreviousMemberId
        FROM GroupsHistory h
        WHERE h.GroupId = @GroupId
        AND h.MemberId = @MemberId;
        IF @GroupId IS NULL
            BREAK;
        UPDATE g SET
            OriginalGroupId = @GroupId,
            OriginalMemberId = @MemberId
        FROM Groups g
        WHERE ClientId = @ClientId
    END
    FETCH NEXT FROM GroupMembers INTO @ClientId, @CurrentGroupId, @CurrentMemberId;
END;
CLOSE GroupMembers;
DEALLOCATE GroupMembers;

I’d love to tell you how long would this take to run, but I’m not sure that I have enough patience to let it run for completion. I can tell you that it will update about 128 clients per minute on my laptop. Extrapolating, this it would take more than 100 hours of execution time for the code to complete. Considering that the sample data was generated in less than one minute, this approach is simply ridiculous.

So, what’s wrong with this approach?

If you say that the problem is that there’s a nested loop within the cursor, you’d be correct, but the problem goes beyond that. This approach is updating the table, row by row, once for every row the history table has. This means that it doesn’t matter if there were only one hundred clients, if the history table has 15 million rows, it will update the Groups table 15 million times.

The “set-based” solution

When seeing this kind of recursive structure, one might be tempted to use a recursive CTE (rCTE) instead – and that was my first choice as well. Recursive CTEs are a powerful tool when you need to traverse a hierarchy with a single statement, so it seems something logical to use it here because it would only update the table once. My original recursive CTE approach looked something like this:

WITH rCTE AS(
    SELECT g.ClientId, g.GroupId, g.MemberId, h.PreviousGroupId, h.PreviousMemberId
    FROM Groups g
    JOIN GroupsHistory h ON g.GroupId = h.GroupId AND g.MemberId = h.MemberId
    UNION ALL
    SELECT g.ClientId, g.GroupId, g.MemberId, h.PreviousGroupId, h.PreviousMemberId
    FROM rCTE g
    JOIN GroupsHistory h ON g.PreviousGroupId = h.GroupId AND g.PreviousMemberId = h.MemberId  
)
UPDATE g SET
    OriginalGroupId = r.GroupId,
    OriginalMemberId = r.MemberId
FROM dbo.Groups g
JOIN rCTE r ON g.ClientId = r.ClientId
WHERE r.PreviousGroupId IS NULL;

It looks simpler than the previous RBAR option and it’s a single statement. With the sample data presented, it ran in 1 minute and 30 seconds, not bad for an improvement. There was one small problem. When I ran this update against the real data, it was still running after four hours and didn’t complete. Since it was a single statement, not a single row was updated. Even if it had completed in five hours, it would be a lot of time and something had to be done to improve the process.

The plot twist

I might have been too dramatic with the timings presented earlier for the RBAR solution. Of course we could improve the query by adding some indexes, improving the cursor declaration instead of using the default values and using an explicit transaction. First, I defined the cursor as static, forward only and read only. Then I created the following indexes:

CREATE CLUSTERED INDEX CI_Groups ON dbo.Groups(ClientId ASC);
CREATE NONCLUSTERED INDEX IX_GroupsHistory ON dbo.GroupsHistory( MemberId ASC, GroupId ASC)
INCLUDE ( PreviousMemberId, PreviousGroupId);

I ran my tests again and got good news and bad news. The bad news was that half of the rows didn’t have the columns OriginalGroupId and OriginalMemberId updated. The fix was easy, because I realized that those where the clients without history, so I added an update at the end to correct the issue and I finally had a complete solution. The code ended up looking like this:

DECLARE @ClientId int,
    @CurrentGroupId int,
    @CurrentMemberId int,
    @GroupId int,
    @MemberId int;
BEGIN TRAN;
DECLARE GroupMembers CURSOR STATIC FORWARD_ONLY READ_ONLY
FOR
SELECT g.ClientId, g.GroupId, g.MemberId
FROM dbo.Groups g
WHERE OriginalGroupId IS NULL
OR OriginalMemberId IS NULL;
OPEN GroupMembers;
FETCH NEXT FROM GroupMembers INTO @ClientId, @CurrentGroupId, @CurrentMemberId;
WHILE @@FETCH_STATUS = 0
BEGIN
    SET @GroupId = @CurrentGroupId;
    SET @MemberId = @CurrentMemberId;
    WHILE EXISTS(
        SELECT h.GroupId, h.MemberId
        FROM dbo.GroupsHistory h
        WHERE h.GroupId = @GroupId
        AND h.MemberId = @MemberId)
    BEGIN
        SELECT @GroupId = PreviousGroupId,
            @MemberId = PreviousMemberId
        FROM GroupsHistory h
        WHERE h.GroupId = @GroupId
        AND h.MemberId = @MemberId;
        IF @GroupId IS NULL
            BREAK;
        UPDATE g SET
            OriginalGroupId = @GroupId,
            OriginalMemberId = @MemberId
        FROM Groups g
        WHERE ClientId = @ClientId;
        SET @GroupId = NULL;
        SET @MemberId = NULL;
    END
    FETCH NEXT FROM GroupMembers INTO @ClientId, @CurrentGroupId, @CurrentMemberId;
END;
CLOSE GroupMembers;
DEALLOCATE GroupMembers;
UPDATE g SET
    OriginalGroupId = GroupId,
    OriginalMemberId = MemberId
FROM Groups g
WHERE OriginalGroupId IS NULL;
COMMIT TRAN;

The good news is that the query completed in 50 seconds. That’s simply awesome considering how the code was crawling with despair and now we have a code that’s running faster than the rCTE. How is it possible that a RBAR approach is running faster than a supposed set-based solution? There’s a problem when using recursive CTEs, the optimizer can’t estimate the correct the number of rows and it won’t create optimal plans. This problem might not occur every time, depending on different factors, but this problem is beyond the scope of this article.

There’s also a heavy usage of tempdb to create the worktable needed for this operation. The indexes didn’t have a major impact on the rCTE as it went down from 90 seconds to 70 seconds, still slower than the improved RBAR approach.

You could say that 50 seconds is fast enough for a process that ran for hours without completing, but I can assure you that fast enough isn’t fast enough when there’s still room for improvement.

The Set-based Loop

I was able to identify the problems for the previous approaches, so I just needed to give the engine the possibility to get the estimates correct and to avoid RBAR. The solution was to go back to a loop but thinking in sets. Basically, instead of doing one update per row, it would do one update per level. Even if this approach seemed to be more resource intensive, it would give the optimizer the capabilities to estimate the rows correctly. The updates turned out like this:

UPDATE g SET
    OriginalGroupId = ISNULL( h.PreviousGroupId, h.GroupId), --Uses ISNULL for first time clients
    OriginalMemberId = ISNULL( h.PreviousMemberId, h.MemberId)
FROM dbo.Groups g
JOIN dbo.GroupsHistory h ON g.GroupId = h.GroupId AND g.MemberId = h.MemberId;
WHILE @@ROWCOUNT > 0 --Update until there's no previous group for any Client
    UPDATE g SET
        OriginalGroupId = h.PreviousGroupId,
        OriginalMemberId = h.PreviousMemberId
    FROM dbo.Groups g
    JOIN dbo.GroupsHistory h ON g.OriginalGroupId = h.GroupId AND g.OriginalMemberId = h.MemberId
    WHERE h.PreviousGroupId IS NOT NULL;

Once more, the code is simpler than the original option. The performance was even better as it took only 17 seconds to run on my laptop. This means that this query completed over four times faster than the rCTE and almost three times faster than the improved RBAR approach. On real data, it took two minutes to run and up to five minutes after a few validations were added according to some business rules that weren’t included before. I don’t want to make any assumptions, but it also seems to scale without problems, although the differences in data distribution might affect the performance as well.

Testing

Of course, a single run wouldn’t prove much because some factors (such as the current server workload) could affect the timing. That’s why I had to be sure that the results would be consistent and not just mere luck. Creating a test harness to prove that an improvement is real and not a simple perception is as important as the improvement itself. That’s why I created the following test (the full script is available in the Resources).

USE Test; --Be sure that you're on a safe database. You could use tempdb as well.
GO
SET NOCOUNT ON;
IF OBJECT_ID( 'dbo.GroupsHistory') IS NOT NULL --Prevent errors on reruns
    DROP TABLE dbo.GroupsHistory;
CREATE TABLE dbo.GroupsHistory( --Create the table with relevant columns
    MemberId            int NOT NULL,
    GroupId             int NOT NULL,
    PreviousMemberId    int NULL,
    PreviousGroupId     int NULL,
    StartDate           date NOT NULL,
    Level               int NOT NULL --This column wasn't available on the original problem. It's here to help with test data generation.
);
IF OBJECT_ID( 'dbo.Groups') IS NOT NULL --Prevent errors on reruns
    DROP TABLE dbo.Groups;
CREATE TABLE dbo.Groups( --Create the table with relevant columns
    ClientId            int IDENTITY NOT NULL,
    MemberId            int NOT NULL,
    GroupId             int NOT NULL,
    OriginalGroupId     int NULL,
    OriginalMemberId    int NULL
);
--Insert first level of test data
WITH E(n) AS(
    SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)
),
TestData AS(
    SELECT ABS( CHECKSUM(NEWID())) % 100000 CurrentGroupId,
        DATEADD( dd, ABS( CHECKSUM(NEWID())) % 365, '20100101') StartDate
    FROM E a, E b, E c, E d, E f, E g
)
INSERT INTO GroupsHistory(MemberId, GroupId, StartDate, Level)
SELECT ROW_NUMBER() OVER(PARTITION BY CurrentGroupId ORDER BY (SELECT NULL)),
    CurrentGroupId,
    StartDate,
    1
FROM TestData;
--Generate the next levels for test data
DECLARE @Level int = 1;
WHILE @Level < 6
BEGIN
    INSERT INTO GroupsHistory(MemberId, GroupId, PreviousMemberId, PreviousGroupId, StartDate, Level)
    SELECT TOP 50 percent
        ROW_NUMBER() OVER( PARTITION BY GroupId ORDER BY (SELECT NULL)),
        ABS( CHECKSUM(NEWID())) % (100000 / @Level) + ( @Level * 100000),
        MemberId,
        GroupId,
        DATEADD( dd, ABS( CHECKSUM(NEWID())) % 365, StartDate) StartDate,
        @Level + 1
    FROM dbo.GroupsHistory
    WHERE Level = @Level
    ORDER BY NEWID();
    SET @Level += 1;
END;
INSERT INTO dbo.Groups(GroupId, MemberId)
SELECT GroupId, MemberId
FROM dbo.GroupsHistory g
WHERE NOT EXISTS(
    SELECT 1
    FROM dbo.GroupsHistory h
    WHERE g.GroupId = h.PreviousGroupId
    AND g.MemberId = h.PreviousMemberId);
--Create a timings table
IF OBJECT_ID('dbo.Timings') IS NOT NULL
    DROP TABLE dbo.Timings;
CREATE TABLE dbo.Timings( With_Index bit, rbar_seconds int, rCTE_seconds int, While_seconds int);
GO
DECLARE
    @SecondsRBAR int,
    @SecondsrCTE int,
    @SecondsWhile int,
    @Time   datetime;
--Clean the procedure cache
DBCC FREEPROCCACHE WITH NO_INFOMSGS; 
--Initialize the columns
UPDATE g SET
    OriginalGroupId = NULL,
    OriginalMemberId = NULL
FROM dbo.Groups g
WHERE OriginalGroupId IS NOT NULL;
--Clean the buffer to read from disk and avoid possible misinterpretations
CHECKPOINT; --Ensure that the dirty pages are written out to disk.
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
--Set the timer
SET @Time = GETDATE();

--RCTE Update
WITH rCTE AS(
    SELECT g.ClientId, g.GroupId, g.MemberId, h.PreviousGroupId, h.PreviousMemberId
    FROM dbo.Groups g
    JOIN dbo.GroupsHistory h ON g.GroupId = h.GroupId AND g.MemberId = h.MemberId
    UNION ALL
    SELECT g.ClientId, g.GroupId, g.MemberId, h.PreviousGroupId, h.PreviousMemberId
    FROM rCTE g
    JOIN dbo.GroupsHistory h ON g.PreviousGroupId = h.GroupId AND g.PreviousMemberId = h.MemberId   
)
UPDATE g SET
    OriginalGroupId = r.GroupId,
    OriginalMemberId = r.MemberId
FROM dbo.Groups g
JOIN rCTE r ON g.ClientId = r.ClientId
WHERE r.PreviousGroupId IS NULL;
--Get the time in seconds
SET @SecondsrCTE = DATEDIFF(ss, @Time, GETDATE());
--Initialize the columns
UPDATE g SET
    OriginalGroupId = NULL,
    OriginalMemberId = NULL
FROM dbo.Groups g
WHERE OriginalGroupId IS NOT NULL;
--Clean the buffer to read from disk and avoid possible misinterpretations
CHECKPOINT; --Ensure that the dirty pages are written out to disk.
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
--Set the timer
SET @Time = GETDATE();
--Set-based loop Update
BEGIN TRAN;
UPDATE g SET
    OriginalGroupId = ISNULL( h.PreviousGroupId, h.GroupId), --Uses ISNULL for first time clients
    OriginalMemberId = ISNULL( h.PreviousMemberId, h.MemberId)
FROM dbo.Groups g
JOIN dbo.GroupsHistory h ON g.GroupId = h.GroupId AND g.MemberId = h.MemberId;
WHILE @@ROWCOUNT > 0 --Update until there's no previous group for any Client
    UPDATE g SET
        OriginalGroupId = h.PreviousGroupId,
        OriginalMemberId = h.PreviousMemberId
    FROM dbo.Groups g
    JOIN dbo.GroupsHistory h ON g.OriginalGroupId = h.GroupId AND g.OriginalMemberId = h.MemberId
    WHERE h.PreviousGroupId IS NOT NULL;
COMMIT TRAN;
--Get the time in seconds
SET @SecondsWhile = DATEDIFF(ss, @Time, GETDATE());
--Insert the times into our timings table
INSERT INTO dbo.Timings
    (with_index, rbar_seconds, rCTE_seconds, while_seconds)
VALUES(0, @SecondsRBAR,@SecondsrCTE, @SecondsWhile);
--Run the test 10 times
GO 10
--Create indexes   
CREATE CLUSTERED INDEX CI_Groups ON dbo.Groups(ClientId ASC);
CREATE NONCLUSTERED INDEX IX_GroupsHistory ON [dbo].[GroupsHistory] ([MemberId],[GroupId]) INCLUDE ([PreviousMemberId],[PreviousGroupId]);
GO

DECLARE
    @SecondsRBAR int,
    @SecondsrCTE int,
    @SecondsWhile int,
    @Time   datetime;
--Clean the procedure cache
DBCC FREEPROCCACHE WITH NO_INFOMSGS; 
--Initialize the columns
UPDATE g SET
    OriginalGroupId = NULL,
    OriginalMemberId = NULL
FROM dbo.Groups g
WHERE OriginalGroupId IS NOT NULL;
--Clean the buffer to read from disk and avoid possible misinterpretations
CHECKPOINT; --Ensure that the dirty pages are written out to disk.
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
--Set the timer
SET @Time = GETDATE();
DECLARE @ClientId int,
    @CurrentGroupId int,
    @CurrentMemberId int,
    @GroupId int,
    @MemberId int;
BEGIN TRAN;
DECLARE GroupMembers CURSOR STATIC FORWARD_ONLY READ_ONLY
FOR
SELECT g.ClientId, g.GroupId, g.MemberId
FROM dbo.Groups g
WHERE OriginalGroupId IS NULL
OR OriginalMemberId IS NULL;
OPEN GroupMembers;
FETCH NEXT FROM GroupMembers INTO @ClientId, @CurrentGroupId, @CurrentMemberId;
WHILE @@FETCH_STATUS = 0
BEGIN
    SET @GroupId = @CurrentGroupId;
    SET @MemberId = @CurrentMemberId;
    WHILE EXISTS(
        SELECT h.GroupId, h.MemberId
        FROM dbo.GroupsHistory h
        WHERE h.GroupId = @GroupId
        AND h.MemberId = @MemberId)
    BEGIN
        SELECT @GroupId = PreviousGroupId,
            @MemberId = PreviousMemberId
        FROM GroupsHistory h
        WHERE h.GroupId = @GroupId
        AND h.MemberId = @MemberId;
        IF @GroupId IS NULL
            BREAK;
        UPDATE g SET
            OriginalGroupId = @GroupId,
            OriginalMemberId = @MemberId
        FROM Groups g
        WHERE ClientId = @ClientId;
        SET @GroupId = NULL;
        SET @MemberId = NULL;
    END;
    FETCH NEXT FROM GroupMembers INTO @ClientId, @CurrentGroupId, @CurrentMemberId;
END;
CLOSE GroupMembers;
DEALLOCATE GroupMembers;
UPDATE g SET
    OriginalGroupId = GroupId,
    OriginalMemberId = MemberId
FROM Groups g
WHERE OriginalGroupId IS NULL;
COMMIT TRAN;
--Get the time in seconds
SET @SecondsRBAR = DATEDIFF(ss, @Time, GETDATE());
--Initialize the columns
UPDATE g SET
    OriginalGroupId = NULL,
    OriginalMemberId = NULL
FROM dbo.Groups g
WHERE OriginalGroupId IS NOT NULL;
--Clean the buffer to read from disk and avoid possible misinterpretations
CHECKPOINT; --Ensure that the dirty pages are written out to disk.
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
--Set the timer
SET @Time = GETDATE();
--RCTE Update
WITH rCTE AS(
    SELECT g.ClientId, g.GroupId, g.MemberId, h.PreviousGroupId, h.PreviousMemberId
    FROM dbo.Groups g
    JOIN dbo.GroupsHistory h ON g.GroupId = h.GroupId AND g.MemberId = h.MemberId
    UNION ALL
    SELECT g.ClientId, g.GroupId, g.MemberId, h.PreviousGroupId, h.PreviousMemberId
    FROM rCTE g
    JOIN dbo.GroupsHistory h ON g.PreviousGroupId = h.GroupId AND g.PreviousMemberId = h.MemberId   
)
UPDATE g SET
    OriginalGroupId = r.GroupId,
    OriginalMemberId = r.MemberId
FROM dbo.Groups g
JOIN rCTE r ON g.ClientId = r.ClientId
WHERE r.PreviousGroupId IS NULL;
--Get the time in seconds
SET @SecondsrCTE = DATEDIFF(ss, @Time, GETDATE());
--Initialize the columns
UPDATE g SET
    OriginalGroupId = NULL,
    OriginalMemberId = NULL
FROM dbo.Groups g
WHERE OriginalGroupId IS NOT NULL;
--Clean the buffer to read from disk and avoid possible misinterpretations
CHECKPOINT; --Ensure that the dirty pages are written out to disk.
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
--Set the timer
SET @Time = GETDATE();
BEGIN TRAN;
--Set-based loop Update
UPDATE g SET
    OriginalGroupId = ISNULL( h.PreviousGroupId, h.GroupId), --Uses ISNULL for first time clients
    OriginalMemberId = ISNULL( h.PreviousMemberId, h.MemberId)
FROM dbo.Groups g
JOIN dbo.GroupsHistory h ON g.GroupId = h.GroupId AND g.MemberId = h.MemberId;
WHILE @@ROWCOUNT > 0 --Update until there's no previous group for any Client
    UPDATE g SET
        OriginalGroupId = h.PreviousGroupId,
        OriginalMemberId = h.PreviousMemberId
    FROM dbo.Groups g
    JOIN dbo.GroupsHistory h ON g.OriginalGroupId = h.GroupId AND g.OriginalMemberId = h.MemberId
    WHERE h.PreviousGroupId IS NOT NULL;
COMMIT TRAN
--Get the time in seconds
SET @SecondsWhile = DATEDIFF(ss, @Time, GETDATE());
--Insert the times into our timings table
INSERT INTO dbo.Timings
    (with_index, rbar_seconds, rCTE_seconds, while_seconds)
VALUES(1, @SecondsRBAR,@SecondsrCTE, @SecondsWhile);
--Run the test 10 times
GO 10

--Query the Test Timings
WITH CTE AS(
    SELECT ROW_NUMBER() OVER(PARTITION BY with_index ORDER BY(SELECT NULL)) Iteration,
        With_Index,
        RBAR_Seconds,
        rCTE_seconds,
        While_seconds
    FROM Timings
)
SELECT Iteration,
    With_Index,
    AVG(rbar_seconds) AS  RBAR_Seconds,
    AVG(rcte_seconds) AS  rCTE_seconds,
    AVG(while_seconds) AS While_seconds
FROM CTE
GROUP BY GROUPING SETS(Iteration, with_index), (with_index);
GO
--Uncomment the following lines to clean the database after our tests.
--DROP TABLE dbo.Timings;
--DROP TABLE dbo.Groups;
--DROP TABLE dbo.GroupsHistory;

The results seemed to be clear and consistent with the set-based loop being the fastest every time. The timings for each run are shown below.

Iteration With_Index RBAR_Seconds rCTE_seconds While_seconds

--------- ---------- ------------ ------------ -------------

1 0 NULL 90 19

2 0 NULL 86 17

3 0 NULL 88 17

4 0 NULL 90 19

5 0 NULL 90 16

6 0 NULL 90 18

7 0 NULL 88 17

8 0 NULL 87 18

9 0 NULL 87 17

10 0 NULL 90 18

AVG 0 NULL 88 17

1 1 65 69 17

2 1 48 68 18

3 1 48 70 18

4 1 52 68 19

5 1 48 65 18

6 1 48 68 18

7 1 49 73 17

8 1 48 66 17

9 1 48 69 17

10 1 49 68 17

AVG 1 50 68 17

Disclaimer: I didn’t include the RBAR option without indexes for obvious reasons. I didn’t include all the tests I made that included different variations to the code which didn’t affect the performance in a significant way.

Conclusion

There’s nothing inherently wrong with a WHILE loop, it can be faster than other methods which people assume to be “pure set-based programming” such as recursive CTEs and triangular joins. The problem is when people use them to process one row at a time instead of thinking in sets.

I won’t say that a WHILE loop is the best option every single time. My intention is to make clear that not all loops are RBAR the same way that RBAR is not always a WHILE loop, as well as mentioning that RBAR is not always the evilest thing around.

There might be further problems that can be found when using these kind of hierarchies such as infinite loops and the Halloween problem, which aren’t addressed in this article but might be worth considering when implementing a similar solution.

I want to thank Jeff Moden for inviting me to write an article as it has been a great experience. I also want to thank Eirikur Eiriksson, Wayne Sheffield and Alan Burstein for their inputs on the original draft and their suggestions on including the indexes.

Thanks for reading, any feedback would be greatly appreciated.

Introducing the Set-based Loop

The original problem

The problematic solution

The “set-based” solution

The plot twist

The Set-based Loop

Testing

Conclusion

Resources

Rate

Share

Categories

Share

Rate

Introducing the Set-based Loop

The original problem

The problematic solution

The “set-based” solution

The plot twist

The Set-based Loop

Testing

Conclusion

Resources

Rate

Share

Categories

Recent comments

Share

Rate

Related content

Combining R and SQL Server to diagnose performance bottlenecks

Grant Fritchey Reviews Minion Reindex

SQL Server 2014 Checklist for Performance

Join Operations – Hash Match

Understanding Graphical Execution Plans - Part 1: Explaining the Plan

Cookies on SQLServerCentral