Rank duplicates, but only rows involved in duplicates

  • Try again:

    Like Jeff I set up a scaled-up dataset to test against but ran out of time over the weekend to post it up. After reading Jeff's post I've added the missing output column to my two queries and a placeholder for it in sqldriver's modified IBG query.

    Here's the scale-up code, which results in a 999999-row table where only the dupes in the OP's original data exist - in other words, it's very heavily skewed:

    -- adjust the type and size of the address column. 80 leaves room for "spareletters" below.

    ALTER TABLE dbo.TestBusiness ALTER COLUMN [Address] VARCHAR(80) NOT NULL

    ;WITH

    E1 AS (SELECT * FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)),

    ThousandRows AS (SELECT

    SpareLetters = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(

    NEWID(),

    '-',''),'0',''),'1',''),'2',''),'3',''),'4',''),'5',''),'6',''),'7',''),'8',''),'9','') + ' - '

    FROM E1 a, E1 b, E1 c)

    INSERT INTO dbo.TestBusiness ([BusinessName], [Address], [Phone])

    SELECT

    BusinessName = SpareLetters + BusinessName ,

    [Address] = SpareLetters + [Address],

    Phone

    FROM [dbo].[TestBusiness] bus

    CROSS JOIN ThousandRows t

    -- create a useful index

    DROP INDEX ix_Helper ON dbo.TestBusiness

    CREATE INDEX ix_Helper ON dbo.TestBusiness ([BusinessName], [Address], [Phone]) INCLUDE ([Business_pk])

    Here are the queries:

    PRINT '-----------------------------------------------------------------------------------------------------------------'

    DROP TABLE [dbo].[PossibleDuplicateBusinesses]

    SET STATISTICS IO, TIME ON

    PRINT 'sqldriver01';

    WITH Dups AS (

    SELECT

    bus.[Business_pk]

    ,[BusinessName], [Address], [Phone]

    ,ROW_NUMBER() OVER (PARTITION BY [BusinessName], [Address], [Phone] ORDER BY bus.[Business_pk] DESC) AS MatchRank,

    NewGroupName = 0

    FROM [dbo].[TestBusiness] bus

    )

    SELECT *

    INTO [dbo].[PossibleDuplicateBusinesses]

    FROM Dups a

    WHERE EXISTS (

    SELECT 1

    FROM Dups b

    WHERE a.BusinessName = b.BusinessName

    AND a.Address = b.Address

    AND a.Phone = b.Phone

    AND b.MatchRank > 1

    );

    SET STATISTICS IO, TIME OFF

    PRINT '-----------------------------------------------------------------------------------------------------------------'

    DROP TABLE [dbo].[PossibleDuplicateBusinesses]

    SET STATISTICS IO, TIME ON

    PRINT 'ChrisM 01';

    WITH Dups AS (

    SELECT BusinessName, [Address], Phone

    FROM dbo.TestBusiness

    GROUP BY BusinessName, [Address], Phone

    HAVING COUNT(*) > 1

    )

    SELECT a.Business_pk, a.BusinessName, a.[Address], a.Phone,

    MatchRank = ROW_NUMBER() OVER (PARTITION BY a.BusinessName, a.[Address], a.Phone ORDER BY a.Business_pk DESC),

    NewGroupName = DENSE_RANK() OVER (ORDER BY a.BusinessName, a.[Address], a.Phone)

    INTO [dbo].[PossibleDuplicateBusinesses]

    FROM Dups b

    INNER hash JOIN dbo.TestBusiness a

    ON a.BusinessName = b.BusinessName

    AND a.Address = b.Address

    AND a.Phone = b.Phone;

    SET STATISTICS IO, TIME OFF

    PRINT '-----------------------------------------------------------------------------------------------------------------'

    DROP TABLE [dbo].[PossibleDuplicateBusinesses]

    SET STATISTICS IO, TIME ON

    PRINT 'ChrisM 02';

    WITH Dups AS (

    SELECT BusinessName, [Address], Phone

    FROM dbo.TestBusiness

    GROUP BY BusinessName, [Address], Phone

    HAVING COUNT(*) > 1

    )

    SELECT a.Business_pk, a.BusinessName, a.[Address], a.Phone,

    MatchRank = ROW_NUMBER() OVER (PARTITION BY a.BusinessName, a.[Address], a.Phone ORDER BY a.Business_pk DESC),

    NewGroupName = DENSE_RANK() OVER (ORDER BY a.BusinessName, a.[Address], a.Phone)

    INTO [dbo].[PossibleDuplicateBusinesses]

    FROM Dups b

    INNER loop JOIN dbo.TestBusiness a -- Changes MERGE JOIN to NESTED LOOPS, loses sort required for ROW_NUMBER()

    ON a.BusinessName = b.BusinessName

    AND a.[Address] = b.[Address]

    AND a.Phone = b.Phone;

    SET STATISTICS IO, TIME OFF

    PRINT '-----------------------------------------------------------------------------------------------------------------';

    DROP TABLE [dbo].[PossibleDuplicateBusinesses]

    SET STATISTICS IO, TIME ON

    PRINT 'sqldriver 02';

    WITH dupe1

    AS ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN ,

    a.BusinessName, a.[Address], a.Phone

    FROM TestBusiness a

    ),

    dupe2

    AS ( SELECT d.BusinessName, d.[Address], d.Phone, NewGroupName = NEWID()

    FROM dupe1 d

    GROUP BY d.BusinessName, d.[Address], d.Phone

    HAVING COUNT(*) > 1

    )

    SELECT d1.*, d2.NewGroupName

    INTO [dbo].[PossibleDuplicateBusinesses]

    FROM dupe1 d1

    INNER JOIN dupe2 d2

    ON d1.BusinessName = d2.BusinessName

    AND d1.[Address] = d2.[Address]

    AND d1.Phone = d2.Phone

    WHERE d1.RN > 1;

    SET STATISTICS IO, TIME OFF

    PRINT '-----------------------------------------------------------------------------------------------------------------';

    DROP TABLE [dbo].[PossibleDuplicateBusinesses]

    SET STATISTICS IO, TIME ON

    PRINT 'JBM 01';

    WITH

    cteUnique AS

    (

    SELECT DupeGroupNumber = ROW_NUMBER() OVER(ORDER BY BusinessName,Address,Phone)

    ,BusinessName, Address, Phone

    FROM dbo.TestBusiness

    GROUP BY BusinessName, Address, Phone

    HAVING COUNT(*) > 1

    )

    SELECT u.DupeGroupNumber

    ,DupeGroupSequence = ROW_NUMBER() OVER (PARTITION BY u.DupeGroupNumber ORDER BY b.Business_PK)

    ,b.*

    INTO [dbo].[PossibleDuplicateBusinesses]

    FROM dbo.TestBusiness b

    JOIN cteUnique u

    ON b.BusinessName = u.BusinessName

    AND b.Address = u.Address

    AND b.Phone = u.Phone

    ;

    SET STATISTICS IO, TIME OFF

    PRINT '-----------------------------------------------------------------------------------------------------------------';

    The batch was run a few times, until the execution times of each batch was stable and reproducible. Here are the results, with noise removed:

    -----------------------------------------------------------------------------------------------------------------

    sqldriver01

    Table 'TestBusiness'. Scan count 10, logical reads 32644, physical reads 0

    SQL Server Execution Times:

    CPU time = 13996 ms, elapsed time = 22828 ms.

    -----------------------------------------------------------------------------------------------------------------

    ChrisM 01

    Table 'TestBusiness'. Scan count 10, logical reads 32644, physical reads 0

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0

    SQL Server Execution Times:

    CPU time = 12044 ms, elapsed time = 4593 ms.

    -----------------------------------------------------------------------------------------------------------------

    ChrisM 02

    Table 'TestBusiness'. Scan count 33, logical reads 16507, physical reads 0

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0

    SQL Server Execution Times:

    CPU time = 7550 ms, elapsed time = 3032 ms.

    -----------------------------------------------------------------------------------------------------------------

    sqldriver 02

    Table 'TestBusiness'. Scan count 2, logical reads 24683, physical reads 0

    SQL Server Execution Times:

    CPU time = 3073 ms, elapsed time = 3225 ms.

    -----------------------------------------------------------------------------------------------------------------

    JBM 01

    Table 'TestBusiness'. Scan count 10, logical reads 33135, physical reads 0

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0

    SQL Server Execution Times:

    CPU time = 12261 ms, elapsed time = 4855 ms.

    -----------------------------------------------------------------------------------------------------------------

    From queries posted so far, some sort of pre-aggregation to get the rowcount down quickly and efficiently wins over partitioning the whole set, and using window functions on the final SELECT seems to give an edge over placing them anywhere else in the query.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing post 31 (of 30 total)

You must be logged in to reply to this topic. Login to reply