October 27, 2014 at 8:51 am
Try again:
Like Jeff I set up a scaled-up dataset to test against but ran out of time over the weekend to post it up. After reading Jeff's post I've added the missing output column to my two queries and a placeholder for it in sqldriver's modified IBG query.
Here's the scale-up code, which results in a 999999-row table where only the dupes in the OP's original data exist - in other words, it's very heavily skewed:
-- adjust the type and size of the address column. 80 leaves room for "spareletters" below.
ALTER TABLE dbo.TestBusiness ALTER COLUMN [Address] VARCHAR(80) NOT NULL
;WITH
E1 AS (SELECT * FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)),
ThousandRows AS (SELECT
SpareLetters = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
NEWID(),
'-',''),'0',''),'1',''),'2',''),'3',''),'4',''),'5',''),'6',''),'7',''),'8',''),'9','') + ' - '
FROM E1 a, E1 b, E1 c)
INSERT INTO dbo.TestBusiness ([BusinessName], [Address], [Phone])
SELECT
BusinessName = SpareLetters + BusinessName ,
[Address] = SpareLetters + [Address],
Phone
FROM [dbo].[TestBusiness] bus
CROSS JOIN ThousandRows t
-- create a useful index
DROP INDEX ix_Helper ON dbo.TestBusiness
CREATE INDEX ix_Helper ON dbo.TestBusiness ([BusinessName], [Address], [Phone]) INCLUDE ([Business_pk])
Here are the queries:
PRINT '-----------------------------------------------------------------------------------------------------------------'
DROP TABLE [dbo].[PossibleDuplicateBusinesses]
SET STATISTICS IO, TIME ON
PRINT 'sqldriver01';
WITH Dups AS (
SELECT
bus.[Business_pk]
,[BusinessName], [Address], [Phone]
,ROW_NUMBER() OVER (PARTITION BY [BusinessName], [Address], [Phone] ORDER BY bus.[Business_pk] DESC) AS MatchRank,
NewGroupName = 0
FROM [dbo].[TestBusiness] bus
)
SELECT *
INTO [dbo].[PossibleDuplicateBusinesses]
FROM Dups a
WHERE EXISTS (
SELECT 1
FROM Dups b
WHERE a.BusinessName = b.BusinessName
AND a.Address = b.Address
AND a.Phone = b.Phone
AND b.MatchRank > 1
);
SET STATISTICS IO, TIME OFF
PRINT '-----------------------------------------------------------------------------------------------------------------'
DROP TABLE [dbo].[PossibleDuplicateBusinesses]
SET STATISTICS IO, TIME ON
PRINT 'ChrisM 01';
WITH Dups AS (
SELECT BusinessName, [Address], Phone
FROM dbo.TestBusiness
GROUP BY BusinessName, [Address], Phone
HAVING COUNT(*) > 1
)
SELECT a.Business_pk, a.BusinessName, a.[Address], a.Phone,
MatchRank = ROW_NUMBER() OVER (PARTITION BY a.BusinessName, a.[Address], a.Phone ORDER BY a.Business_pk DESC),
NewGroupName = DENSE_RANK() OVER (ORDER BY a.BusinessName, a.[Address], a.Phone)
INTO [dbo].[PossibleDuplicateBusinesses]
FROM Dups b
INNER hash JOIN dbo.TestBusiness a
ON a.BusinessName = b.BusinessName
AND a.Address = b.Address
AND a.Phone = b.Phone;
SET STATISTICS IO, TIME OFF
PRINT '-----------------------------------------------------------------------------------------------------------------'
DROP TABLE [dbo].[PossibleDuplicateBusinesses]
SET STATISTICS IO, TIME ON
PRINT 'ChrisM 02';
WITH Dups AS (
SELECT BusinessName, [Address], Phone
FROM dbo.TestBusiness
GROUP BY BusinessName, [Address], Phone
HAVING COUNT(*) > 1
)
SELECT a.Business_pk, a.BusinessName, a.[Address], a.Phone,
MatchRank = ROW_NUMBER() OVER (PARTITION BY a.BusinessName, a.[Address], a.Phone ORDER BY a.Business_pk DESC),
NewGroupName = DENSE_RANK() OVER (ORDER BY a.BusinessName, a.[Address], a.Phone)
INTO [dbo].[PossibleDuplicateBusinesses]
FROM Dups b
INNER loop JOIN dbo.TestBusiness a -- Changes MERGE JOIN to NESTED LOOPS, loses sort required for ROW_NUMBER()
ON a.BusinessName = b.BusinessName
AND a.[Address] = b.[Address]
AND a.Phone = b.Phone;
SET STATISTICS IO, TIME OFF
PRINT '-----------------------------------------------------------------------------------------------------------------';
DROP TABLE [dbo].[PossibleDuplicateBusinesses]
SET STATISTICS IO, TIME ON
PRINT 'sqldriver 02';
WITH dupe1
AS ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN ,
a.BusinessName, a.[Address], a.Phone
FROM TestBusiness a
),
dupe2
AS ( SELECT d.BusinessName, d.[Address], d.Phone, NewGroupName = NEWID()
FROM dupe1 d
GROUP BY d.BusinessName, d.[Address], d.Phone
HAVING COUNT(*) > 1
)
SELECT d1.*, d2.NewGroupName
INTO [dbo].[PossibleDuplicateBusinesses]
FROM dupe1 d1
INNER JOIN dupe2 d2
ON d1.BusinessName = d2.BusinessName
AND d1.[Address] = d2.[Address]
AND d1.Phone = d2.Phone
WHERE d1.RN > 1;
SET STATISTICS IO, TIME OFF
PRINT '-----------------------------------------------------------------------------------------------------------------';
DROP TABLE [dbo].[PossibleDuplicateBusinesses]
SET STATISTICS IO, TIME ON
PRINT 'JBM 01';
WITH
cteUnique AS
(
SELECT DupeGroupNumber = ROW_NUMBER() OVER(ORDER BY BusinessName,Address,Phone)
,BusinessName, Address, Phone
FROM dbo.TestBusiness
GROUP BY BusinessName, Address, Phone
HAVING COUNT(*) > 1
)
SELECT u.DupeGroupNumber
,DupeGroupSequence = ROW_NUMBER() OVER (PARTITION BY u.DupeGroupNumber ORDER BY b.Business_PK)
,b.*
INTO [dbo].[PossibleDuplicateBusinesses]
FROM dbo.TestBusiness b
JOIN cteUnique u
ON b.BusinessName = u.BusinessName
AND b.Address = u.Address
AND b.Phone = u.Phone
;
SET STATISTICS IO, TIME OFF
PRINT '-----------------------------------------------------------------------------------------------------------------';
The batch was run a few times, until the execution times of each batch was stable and reproducible. Here are the results, with noise removed:
-----------------------------------------------------------------------------------------------------------------
sqldriver01
Table 'TestBusiness'. Scan count 10, logical reads 32644, physical reads 0
SQL Server Execution Times:
CPU time = 13996 ms, elapsed time = 22828 ms.
-----------------------------------------------------------------------------------------------------------------
ChrisM 01
Table 'TestBusiness'. Scan count 10, logical reads 32644, physical reads 0
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
SQL Server Execution Times:
CPU time = 12044 ms, elapsed time = 4593 ms.
-----------------------------------------------------------------------------------------------------------------
ChrisM 02
Table 'TestBusiness'. Scan count 33, logical reads 16507, physical reads 0
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
SQL Server Execution Times:
CPU time = 7550 ms, elapsed time = 3032 ms.
-----------------------------------------------------------------------------------------------------------------
sqldriver 02
Table 'TestBusiness'. Scan count 2, logical reads 24683, physical reads 0
SQL Server Execution Times:
CPU time = 3073 ms, elapsed time = 3225 ms.
-----------------------------------------------------------------------------------------------------------------
JBM 01
Table 'TestBusiness'. Scan count 10, logical reads 33135, physical reads 0
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
SQL Server Execution Times:
CPU time = 12261 ms, elapsed time = 4855 ms.
-----------------------------------------------------------------------------------------------------------------
From queries posted so far, some sort of pre-aggregation to get the rowcount down quickly and efficiently wins over partitioning the whole set, and using window functions on the final SELECT seems to give an edge over placing them anywhere else in the query.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing post 31 (of 30 total)
You must be logged in to reply to this topic. Login to reply