February 17, 2017 at 9:19 am
I have a sub-set of customers :
CREATE TABLE #temptable ( [ID] int, [Mobile] nvarchar(20), [DOB] date, [Email] nvarchar(50), [PostCode] nvarchar(10) )
INSERT INTO #temptable
VALUES
( 1, N'078', N'1970-01-01T00:00:00', N'bob@bob.com', N'L1 3RT' ),
( 2, N'079', N'1981-01-02T00:00:00', N'bob@bob.com', N'L1 3RT' ),
( 3, N'034', N'1986-05-03T00:00:00', N'bert@aol.com', N'CH1 3ER' ),
( 4, N'078', N'1970-01-01T00:00:00', N'bob2@bob.com', N'L1 3RT1' )
I need to detect associations between customers - this is done by matching customers where any 2 fields of the 4 (Mobile, DOB, Email, PostCode) match.
The record size is approx 2.5m.
Everything I've tried to date is too slow, for instance self-joins on pairs of fields. I've not tried cursors as think this will be a no no too.
The result I'm after is a table with 2 columns : ID, RelatedID (Each ID can have 1 or many relatedID's)
For instance the result on the sample data would be :
ID RelatedID
1 1
1 2
1 4
2 1
3 3
4 1
4 4
I wondered what the optimal way of doing this is. To be honest I'm beginning to think it's not even up SQL's street.
I wondered if there's anything in TSQL2016 that comes in handy in this respect?
February 17, 2017 at 10:58 am
Well, I can get you down to:
Scan count 12, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Here's the query:
CREATE TABLE #temptable (
ID int,
Mobile nvarchar(20),
DOB date,
Email nvarchar(50),
PostCode nvarchar(10)
);
INSERT INTO #temptable (ID, Mobile, DOB, Email, PostCode)
VALUES
(1, N'078', N'1970-01-01T00:00:00', N'bob@bob.com', N'L1 3RT'),
(2, N'079', N'1981-01-02T00:00:00', N'bob@bob.com', N'L1 3RT'),
(3, N'034', N'1986-05-03T00:00:00', N'bert@aol.com', N'CH1 3ER'),
(4, N'078', N'1970-01-01T00:00:00', N'bob2@bob.com', N'L1 3RT1');
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT T.ID, T2.ID AS RelatedID, 'Mobile / DOB' AS RelatedFields
FROM #temptable AS T
INNER JOIN #temptable AS T2
ON T.ID <> T2.ID
AND T2.Mobile = T.Mobile
AND T2.DOB = T.DOB
UNION
SELECT T.ID, T2.ID AS RelatedID, 'Mobile / Email' AS RelatedFields
FROM #temptable AS T
INNER JOIN #temptable AS T2
ON T.ID <> T2.ID
AND T2.Mobile = T.Mobile
AND T2.Email = T.Email
UNION
SELECT T.ID, T2.ID AS RelatedID, 'Mobile / PostCode' AS RelatedFields
FROM #temptable AS T
INNER JOIN #temptable AS T2
ON T.ID <> T2.ID
AND T2.Mobile = T.Mobile
AND T2.PostCode = T.PostCode
UNION
SELECT T.ID, T2.ID AS RelatedID, 'DOB / Email' AS RelatedFields
FROM #temptable AS T
INNER JOIN #temptable AS T2
ON T.ID <> T2.ID
AND T2.DOB = T.DOB
AND T2.Email = T.Email
UNION
SELECT T.ID, T2.ID AS RelatedID, 'DOB / PostCpde' AS RelatedFields
FROM #temptable AS T
INNER JOIN #temptable AS T2
ON T.ID <> T2.ID
AND T2.DOB = T.DOB
AND T2.PostCode = T.PostCode
UNION
SELECT T.ID, T2.ID AS RelatedID, 'Email / PostCode' AS RelatedFields
FROM #temptable AS T
INNER JOIN #temptable AS T2
ON T.ID <> T2.ID
AND T2.Email = T.Email
AND T2.PostCode = T.PostCode
ORDER BY ID, RelatedID;
DROP TABLE #temptable;
It covers all 6 possible pairs of 2 fields, which would appear to be the only necessary combinations. The larger that number, the worse things get. Let me know if that's any improvement over what you have.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
February 17, 2017 at 12:10 pm
Just another idea. Unpivot the columns you want to compare and count the number of ID pairs. At least it makes it easier to add more columns to compare.
WITH ColumnValues AS
(
SELECT
ID, ColumnID, ColumnValue
FROM
#temptable
CROSS APPLY
(
VALUES (1, CAST(Mobile AS NVARCHAR(50))), (2, CAST(DOB AS NVARCHAR(50))), (3, Email), (4, CAST(PostCode AS NVARCHAR(50)))
) CA(ColumnID, ColumnValue)
)
SELECT
CV1.ID, CV2.ID
FROM
ColumnValues CV1
JOIN
ColumnValues CV2 ON CV2.ColumnID = CV1.ColumnID AND CV2.ColumnValue = CV1.ColumnValue
GROUP BY
CV1.ID, CV2.ID
HAVING
COUNT(*) > 1
ORDER BY
CV1.ID, CV2.ID;
It's probably a good idea to persist the outcome of the CTE ColumnValues in a temporary table and put an index on it.
SELECT
ID, ColumnID, ColumnValue
INTO
#t
FROM
#temptable
CROSS APPLY
(
VALUES (1, CAST(Mobile AS NVARCHAR(50))), (2, CAST(DOB AS NVARCHAR(50))), (3, Email), (4, CAST(PostCode AS NVARCHAR(50)))
) CA(ColumnID, ColumnValue);
CREATE CLUSTERED INDEX PK_T ON #t(ColumnID, ColumnValue);
SELECT
CV1.ID, CV2.ID
FROM
#t CV1
JOIN
#t CV2 ON CV2.ColumnID = CV1.ColumnID AND CV2.ColumnValue = CV1.ColumnValue
GROUP BY
CV1.ID, CV2.ID
HAVING
COUNT(*) > 1
ORDER BY
CV1.ID, CV2.ID;
P.S. Since (1,1), (3,3) and (4,4) are part of the expected output, I guess (2,2) is missing in your sample output.
February 17, 2017 at 12:39 pm
Nice query, Peter. Only 2 scans and 2 logical reads. Here's a slightly modified version that can also tell you what field pairs were involved:
WITH ColumnValues AS (
SELECT ID, ColumnID, ColumnName, ColumnValue
FROM #temptable
CROSS APPLY
(
VALUES (1, 'Mobile', CAST(Mobile AS nvarchar(50))),
(2, 'DOB', CAST(DOB AS nvarchar(50))),
(3, 'Email', Email),
(4, 'PostCode', CAST(PostCode AS nvarchar(50)))
) AS CA(ColumnID, ColumnName, ColumnValue)
)
SELECT CV1.ID, CV2.ID AS RelatedID, MIN(CV1.ColumnName) + ' / ' + MAX(CV2.ColumnName) AS RelatedFields
FROM ColumnValues AS CV1
JOIN ColumnValues AS CV2
ON CV2.ColumnID = CV1.ColumnID
AND CV2.ColumnValue = CV1.ColumnValue
AND CV2.ID <> CV1.ID
GROUP BY CV1.ID, CV2.ID
HAVING COUNT(*) > 1
ORDER BY CV1.ID, CV2.ID;
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
February 17, 2017 at 1:08 pm
With 4 rows as test input, the number of scans and logical reads are a little bit meaningless. The version with the persisted CTE ColumnValues resulted in 17 scans (1 table scan on #t and 16 index lookups). With 2.5m rows as the real input (resulting in 10m rows in #t), I expect two clustered index scans on #t and a merge join followed by a sort for the aggregation. Since the row size of #t is very small, I guess this might be a relatively fast solution. But again, just guessing.
February 20, 2017 at 2:38 am
I'll test queries and let you know the stats.
Lee
February 20, 2017 at 6:57 am
Thanks for providing the solutions. It's interesting to see how unions can be removed with cross join (good learning for the future). sgmunson nice way to deliver the match type - thanks 🙂
I persisted the CTE and added the clustered index (nice optimisation - I will note for the future). With 2.5m rows the query was too expensive - taking over 2 hours before I killed it. This matches my earlier experience.
There's been a change in requirements which let me reduce the data set to 65K rows. After un-pivoting this grows to 260K.
Running the equi-join was very fast with the smaller dataset.
SQL Server parse and compile time:
CPU time = 10 ms, elapsed time = 10 ms.
Table '#t_'. Scan count 10, logical reads 4160, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 22076, logical reads 429364, physical reads 0, read-ahead reads 553, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 20, logical reads 4984, physical reads 423, read-ahead reads 4561, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6155 ms, elapsed time = 2956 ms.Cheers.
February 21, 2017 at 2:26 pm
leehbi - Friday, February 17, 2017 9:19 AMI have a sub-set of customers :
CREATE TABLE #temptable ( [ID] int, [Mobile] nvarchar(20), [DOB] date, [Email] nvarchar(50), [PostCode] nvarchar(10) )
INSERT INTO #temptable
VALUES
( 1, N'078', N'1970-01-01T00:00:00', N'bob@bob.com', N'L1 3RT' ),
( 2, N'079', N'1981-01-02T00:00:00', N'bob@bob.com', N'L1 3RT' ),
( 3, N'034', N'1986-05-03T00:00:00', N'bert@aol.com', N'CH1 3ER' ),
( 4, N'078', N'1970-01-01T00:00:00', N'bob2@bob.com', N'L1 3RT1' )
I need to detect associations between customers - this is done by matching customers where any 2 fields of the 4 (Mobile, DOB, Email, PostCode) match.The record size is approx 2.5m.
Everything I've tried to date is too slow, for instance self-joins on pairs of fields. I've not tried cursors as think this will be a no no too.
The result I'm after is a table with 2 columns : ID, RelatedID (Each ID can have 1 or many relatedID's)
For instance the result on the sample data would be :
ID RelatedID
1 1
1 2
1 4
2 1
3 3
4 1
4 4
I wondered what the optimal way of doing this is. To be honest I'm beginning to think it's not even up SQL's street.I wondered if there's anything in TSQL2016 that comes in handy in this respect?
the first week of your database class? The definition of a primary key?It has to be made up of attributes in the table and not a physicalcount of things in the storage of the table. We also have a DATE datatype in SQL that you failed to use in the insertion statement. Hereis an attempt correct what you did
INSERT INTO Customers
VALUES
('078','1970-01-01', 'bob@bob.com', 'L1 3RT'),
('079','1981-01-02', 'bob@bob.com', 'L1 3RT'), – error
('034','1986-05-03', 'bert@aol.com', 'CH1 3ER'),
('078','1970-01-01', 'bob2@bob.com', 'L1 3RT1');
the two lines of the text of the insertion statement are wrong;they have a duplicate primary key, which is the email address. This should have been handled in an update of some kind to could change the birthdate. We really cannot fix the fact that you have a bad system with the improper performance.
>>I need to detect associations between customers - this is done bymatching customers where any 2 fields of [sic] the 4 (Mobile,birth_date, Email, PostCode) match. <<
do you understand the fundamental concept of RDBMS? Rows in the tablehave a key which makes them unique. You should have been scrubbing your data before you put it in the table to guarantee that you have actual real records. What you are doing is a way we used to do with Mag tapes and punchcards; throw a lot of bad data in a file sortedand then scan for payers a possible duplicates. We do not do that anymore.
>>The record [sic: rows are not records!] size is approx 2.5m. <<
please learn the difference between rows and records is one of your major problemsalso, what do you mean the size of a row is 2.5 m?
>>Everything I've tried to date is too slow, for instance self-joins on pairs of fields [sic]. <<
you do you understand the problem is that you are trying to clean up the mess you make after you make it? Do not let the bad data get in in thefirst place.
>>I've not tried cursors as think this will be a no no too. <<
oh yes! This like walking on to a vegan website and saying that you want to eat fried babies. :crying::crying:
Please post DDL and follow ANSI/ISO standards when asking for help.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply