January 9, 2016 at 6:32 pm
Hello,
I want to join 2 tables, #test_table_1 with #test_table_2. Then I want to compare EmailAddress field on #test_table_1 with the following fields on #test_table_2, EmailAddress, Email1, and Email2.
The join condition from #test_table_1 should compare against these 3 fields, and if any of the fields match, return only the matching row from #test_table_1
So the result should be:
EmailAddress FirstName LastName
c@b.com Jack Quick
Below is some sample data to assist. Naturally there will be many rows in each table.
create table #test_table_1
(EmailAddress varchar(100),FirstName varchar(100),LastName varchar(100))
create table #test_table_2
(EmailAddress varchar(100),Email1 varchar(100),Email2 varchar(100))
insert into #test_table_1
select 'c@b.com' as EmailAddress, 'Jack' as FirstName, 'Quick' as LastName
insert into #test_table_2
select 'a@b.com' as EmailAddress, 'b@b.com' as Email1, 'c@b.com' as Email2
HTH. Thanks in advance..
January 9, 2016 at 6:50 pm
create this query:
SELECT EmailAddress, EMail1 AS Mail1
FROM #Test_table_2
UNION ALL
SELECT EmailAddress, EMail2
FROM #Test_table_2
then join that to whatever other table you need.
Normalization is a beautiful thing.
January 9, 2016 at 8:12 pm
Sorry that doesn't really help me. Can you put the complete query please?
Data won't be identical in both tables, that's why I'm trying to compare and identify when a match occurs.
January 9, 2016 at 8:25 pm
Scratch all that.
You have a basic normalization problem. If you had a single column/field in your table for e-mail address, then it would probably be a trivial question. So you have two options: either fix the table structure, or create a UNION query to return something that looks like a normal table structure. Then you can join your two tables on a single column and you're home free.
Querying tables that aren't normalized is a serious nightmare. I would fix that first.
January 9, 2016 at 8:43 pm
Data comes that way. I can't normalize. I think this is analogous to an index match function from xls, which comepares one column against the other 3 columns then return matching result. Need to find way to do in sql.
January 9, 2016 at 10:24 pm
Quick suggestion, unpivot the second table and join the results to the first table.
😎
create table #test_table_2
(EmailAddress varchar(100),Email1 varchar(100),Email2 varchar(100))
INSERT INTO #test_table_1
SELECT 'c@b.com' as EmailAddress, 'Jack' as FirstName, 'Quick' as LastName
insert into #test_table_2
select 'a@b.com' as EmailAddress, 'b@b.com' as Email1, 'c@b.com' as Email2
SELECT
X.CID
,X.EMAIL
,T1.FirstName
,T1.LastName
FROM #test_table_2 T2
CROSS APPLY
(
SELECT 1, T2.EmailAddress UNION ALL
SELECT 2, T2.Email1 UNION ALL
SELECT 3, T2.Email2
)AS X(CID,EMAIL)
INNER JOIN #test_table_1 T1
ON X.EMAIL = T1.EmailAddress;
Output
CID EMAIL FirstName LastName
----- --------- ----------- ---------
3 c@b.com Jack Quick
January 10, 2016 at 5:11 am
VegasL (1/9/2016)
Data comes that way. I can't normalize. I think this is analogous to an index match function from xls, which comepares one column against the other 3 columns then return matching result. Need to find way to do in sql.
If this is coming from some impoprted source, then you can normalize it. Just load the data as is in a staging table, then validate and massage until it is in the right shape. (Preferably load SSIS for this, that tool is much better equipped for the job).
If the database design is fixed and you cannot influence it, then I pity you. For this specific problem you can probably use something liek the below (though I would not expect too much in the performance department);
SELECT t1.EmailAdress, t1.FirstName, t1.LastName
FROM #test_table1 AS t1
WHERE EXISTS
(SELECT *
FROM #test_table2 AS t2
WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);
But with tables designed this way, you can expect to keep running into problems that require clumsy workarounds in query, and that wreck performance.
January 10, 2016 at 6:09 am
Hugo Kornelis (1/10/2016)
VegasL (1/9/2016)
Data comes that way. I can't normalize. I think this is analogous to an index match function from xls, which comepares one column against the other 3 columns then return matching result. Need to find way to do in sql.If this is coming from some impoprted source, then you can normalize it. Just load the data as is in a staging table, then validate and massage until it is in the right shape. (Preferably load SSIS for this, that tool is much better equipped for the job).
If the database design is fixed and you cannot influence it, then I pity you. For this specific problem you can probably use something liek the below (though I would not expect too much in the performance department);
SELECT t1.EmailAdress, t1.FirstName, t1.LastName
FROM #test_table1 AS t1
WHERE EXISTS
(SELECT *
FROM #test_table2 AS t2
WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);
But with tables designed this way, you can expect to keep running into problems that require clumsy workarounds in query, and that wreck performance.
Be very careful here as this will perform very badly, in fact it will scan the second table for every row in the first table. As it works fine with very small number of rows this has the potentials of being a database time bomb!
😎
To demonstrate the difference, here is a simple test harness
USE tempdb;
GO
SET NOCOUNT ON;
IF OBJECT_ID(N'tempdb..#test_table_1') IS NOT NULL DROP TABLE #test_table_1;
create table #test_table_1
(X_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,EmailAddress varchar(100),FirstName varchar(100),LastName varchar(100));
IF OBJECT_ID(N'tempdb..#test_table_2') IS NOT NULL DROP TABLE #test_table_2;
create table #test_table_2
(X_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,EmailAddress varchar(100),Email1 varchar(100),Email2 varchar(100));
DECLARE @SAMPLE_SIZE INT = 10000;
DECLARE @VARIANCE INT = 5000;
;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))
, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6)
INSERT INTO #test_table_1 (EmailAddress,FirstName,LastName)
SELECT
CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'
,CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0)
,CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0)
FROM NUMS NM
;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))
, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6)
insert into #test_table_2 (EmailAddress,Email1,Email2)
SELECT
CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'
,CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'
,CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'
FROM NUMS NM;
CREATE NONCLUSTERED INDEX NCLIDX_TMP_TEST_1_EMAIL_INCL_FIRST_LAST ON #test_table_1 (EmailAddress ASC) INCLUDE (FirstName,LastName);
CREATE NONCLUSTERED INDEX NCLIDX_TMP_TEST_1_EMAIL_1 ON #test_table_2 (EmailAddress ASC) INCLUDE (Email1,Email2);
DECLARE @timer TABLE (T_TEXT VARCHAR(20) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()));
DECLARE @VARCHAR_BUCKET_01 VARCHAR(100) = '';
DECLARE @VARCHAR_BUCKET_02 VARCHAR(100) = '';
DECLARE @VARCHAR_BUCKET_03 VARCHAR(100) = '';
INSERT INTO @timer(T_TEXT) VALUES('UNPIVOT');
SELECT
@VARCHAR_BUCKET_01 = X.EMAIL
,@VARCHAR_BUCKET_02 = T1.FirstName
,@VARCHAR_BUCKET_03 = T1.LastName
FROM #test_table_2 T2
CROSS APPLY
(
SELECT T2.EmailAddress UNION ALL
SELECT T2.Email1 UNION ALL
SELECT T2.Email2
)AS X(EMAIL)
INNER JOIN #test_table_1 T1
ON X.EMAIL = T1.EmailAddress;
INSERT INTO @timer(T_TEXT) VALUES('UNPIVOT');
INSERT INTO @timer(T_TEXT) VALUES('WHERE EXISTS');
SELECT
@VARCHAR_BUCKET_01 = t1.EmailAddress
,@VARCHAR_BUCKET_02 = t1.FirstName
,@VARCHAR_BUCKET_03 = t1.LastName
FROM #test_table_1 AS t1
WHERE EXISTS
(SELECT *
FROM #test_table_2 AS t2
WHERE t1.EmailAddress IN (t2.EmailAddress, t2.Email1, t2.Email2));
INSERT INTO @timer(T_TEXT) VALUES('WHERE EXISTS');
SELECT
T.T_TEXT
,DATEDIFF(MICROSECOND,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION
FROM @timer T
GROUP BY T.T_TEXT
ORDER BY DURATION;
Results (2nd gen i5 laptop)
T_TEXT DURATION
-------------------- -----------
UNPIVOT 53003
WHERE EXISTS 4352242
January 10, 2016 at 6:27 am
I'm pretty sure I already mentioned that the performance will be dramatic.
Also, if the table grows significantly I would index all three columns. I expect a plan with three seeks per row of the first table in that case. Still not well-performing, but that is almost always impossible to achieve when you have to work with unnormalized data.
January 10, 2016 at 6:46 am
Thank Eirikur Eiriksson & Hugo & everyone else..appreciate you're help. Performance at this stage is a non issue.
Hugo, I got Incorrect syntax near ';' when I tried:
SELECT t1.EmailAdress, t1.FirstName, t1.LastName
FROM #test_table1 AS t1
WHERE EXISTS
(SELECT *
FROM #test_table2 AS t2
WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);
January 10, 2016 at 6:49 am
Hugo Kornelis (1/10/2016)
I'm pretty sure I already mentioned that the performance will be dramatic.Also, if the table grows significantly I would index all three columns. I expect a plan with three seeks per row of the first table in that case. Still not well-performing, but that is almost always impossible to achieve when you have to work with unnormalized data.
You did mention the poor performance but I felt the warning wasn't strong enough.;-)
😎
The unpivot method on a set of 1,000,000 rows with a match ratio of 1:2 returned in less than 5 seconds, I killed the WHERE EXISTS after it had been running for 15 minutes.
Adding a separate index on all three columns is unlikely to help as the server will most likely use a full scan on the second table and a lazy spool into the outer part of a nested loop join to the first table, as single compound index of all three columns is more likely to be used but doesn't help the performance though.
January 10, 2016 at 6:53 am
VegasL (1/10/2016)
Thank Hugo & everyone else..appreciate you're help. Performance at this stage is a non issue.Hugo, I got Incorrect syntax near ';' when I tried:
SELECT t1.EmailAdress, t1.FirstName, t1.LastName
FROM #test_table1 AS t1
WHERE EXISTS
(SELECT *
FROM #test_table2 AS t2
WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);
DON'T USE THIS METHOD!
😎
Why use code that can and potentially will cause problems when you have the opportunity to do things properly in the first place? Just does not make sense.
January 10, 2016 at 7:05 am
VegasL (1/10/2016)
Hugo, I got Incorrect syntax near ';' when I tried:
The syntax error is a missing closing parenthesis.
However, Eirikur is right: his method performs better (or probably I should say less bad) then the one I posted. I just did a few tests based on his feedback and test code and even adding the indexes I hoped would help did not help. At all.
I posted my version because I think that that code is clearer and easier to understand and maintain, but I must now ask you to please only use it if you are very sure that the amount of data will never be more than a few handful of rows. If there is even the remotest chance that the tables will grow to a significant size, then please use the code Eirikur posted.
January 10, 2016 at 7:10 am
Piling on, if for some reasons you don't want to use the unpivoting method then using three separate WHERE IN clauses is a much better option, it will use indices on the three columns and it will only scan them once. Although it's slower than the unpivot by the factor of 2, it's still much much faster than the WHERE EXISTS method.
😎
SELECT
t1.EmailAddress
,t1.FirstName
,t1.LastName
FROM #test_table_1 AS t1
WHERE EmailAddress IN
(SELECT t2.EmailAddress
FROM #test_table_2 AS t2)
OR EmailAddress IN
(SELECT t2.Email1
FROM #test_table_2 AS t2)
OR EmailAddress IN
(SELECT t2.Email2
FROM #test_table_2 AS t2);
January 10, 2016 at 9:29 pm
Thanks Hugo & Eirikur Eiriksson again. The query below seems a bit easier to understand. Is there anyway to join tables T1 with T2, where EmailAddress from T1 is joined to either t2.emailaddress,t2.email1, or t2.email3 ? I want to be able to at same time put values in initial select query, what has values from t2, using the where in /or subquery logic you have below.
SELECT
t1.EmailAddress
,t1.FirstName
,t1.LastName
FROM #test_table_1 AS t1
WHERE EmailAddress IN
(SELECT t2.EmailAddress
FROM #test_table_2 AS t2)
OR EmailAddress IN
(SELECT t2.Email1
FROM #test_table_2 AS t2)
OR EmailAddress IN
(SELECT t2.Email2
FROM #test_table_2 AS t2);
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply