June 3, 2010 at 3:03 am
Hi All,
I am trying to find a way to add into a table a flattened (comma seperated list) of email addresses based on the multiple columns of nformation in another table (joined by customer_full_name and postcode.
This is to highlight duplicate email addresses for people under the same customer_full_name and Postcode.
I have done this using a loop which loops through concatenating the email addresses but it takes 1minute to do 1000. The table is 19,000 so this isn't really acceptable. I have tried temp tables, table variables and none of this seems to make any difference. I think that it is becuase i am joining on text columns?
I know you guys are MASSIVE fans of the tally table, but i have not quite got my head around how i can adapt it to make it work for this.
any hints or tips would be massively appreciated.
Please find some sample data below.
MASSIVE thanks,
Matt
Create table #tempa
(
customer_Full_Name varchar(100),
Customer_Email varchar(100),
Postcode varchar(100),
AlternateEmail varchar(max)NULL
)
insert into #tempa (customer_full_name,customer_email,postcode)
select 'mark davis','mdavis@yahoo.co.uk','me18 7ed'
insert into #tempa (customer_full_name,customer_email,postcode)
select 'mark travis','mtravis@yahoo.co.uk','SE18 5et'
insert into #tempa (customer_full_name,customer_email,postcode)
select 'mike pervis','mpervis@hotmail.co.uk','SW1 5et'
insert into #tempa (customer_full_name,customer_email,postcode)
select 'Frank line','fline@Btconnect.co.uk','YR3 9RT'
Create table #tempb
(
Idcol int identity (1,1),
Customer_Full_Name varchar(100),
Customer_Email varchar(100),
Postcode varchar(100)
)
insert into #tempb (customer_full_name,customer_email,postcode)
select 'mark davis','mdavis@yahoo.co.uk','me18 7ed'
insert into #tempb (customer_full_name,customer_email,postcode)
select 'mark davis','mdavis@hotmail..co.uk','me18 7ed'
insert into #tempb (customer_full_name,customer_email,postcode)
select 'mark davis','mark.davis@Gmail..co.uk','me18 7ed'
insert into #tempb (customer_full_name,customer_email,postcode)
select 'mark davis','bigboymark@Gmail..co.uk','me18 7ed'
insert into #tempb (customer_full_name,customer_email,postcode)
select 'mike pervis','mark.pervis@hotmail.co.uk','SW1 5et'
insert into #tempb (customer_full_name,customer_email,postcode)
select 'mike pervis','markyp@Gmail.co.uk','SW1 5et'
June 3, 2010 at 4:19 am
Hey Matt,
So that's reasonably clear, but what output do you want? Do you want a list of matches between the two tables on email address AND postcode, where there is more than one match? Do you want to update the alternate_email column with the comma-separated list of email addresses...or something else?
Paul
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
June 3, 2010 at 4:24 am
Hi Paul,
Thanks for the response.
what i am hoping for is to update the alternate email culumn with all of the email addresses that we have listed for that customer i.e.
Mark.davis@gmail.co.uk, Mdavis @hotmail.co.uk, mdavis@yahoo.co.uk
I have this at the moment which is grouping the customers together and ranking them, but i cant figure out (wihout a loop!) how to get this flattened down into the comma seperated list!
with cte as (
select
ROW_NUMBER() over(PARTITION BY b.customer_full_name order by b.customer_full_name)idcol,
b.customer_full_name,b.customer_email
from #tempb b join #tempa a on a.customer_full_name = b.customer_full_name
group by b.customer_full_name,b.customer_email
)
ANY help is very much appreciated.
Thanks
Matt
June 3, 2010 at 5:02 am
Ok, I need to do some stuff with a hash and an index to make this fly for larger sets, but as a demonstration only:
Setup script (slightly modified)
CREATE TABLE #tempa
(
customer_full_name VARCHAR(100) NOT NULL,
customer_email VARCHAR(100) NOT NULL,
postcode VARCHAR(100) NOT NULL,
alternateemail VARCHAR(MAX) NULL,
);
GO
INSERT #tempa (customer_full_name,customer_email,postcode)
VALUES ('mark davis', 'mdavis@yahoo.co.uk', 'me18 7ed');
INSERT #tempa (customer_full_name,customer_email,postcode)
VALUES ('mark travis', 'mtravis@yahoo.co.uk', 'SE18 5et');
INSERT #tempa (customer_full_name,customer_email,postcode)
VALUES ('mike pervis', 'mpervis@hotmail.co.uk', 'SW1 5et');
INSERT #tempa (customer_full_name,customer_email,postcode)
VALUES ('Frank line', 'fline@Btconnect.co.uk', 'YR3 9RT');
GO
CREATE TABLE #tempb
(
idcol INTEGER IDENTITY(1,1) PRIMARY KEY,
customer_full_name VARCHAR(100),
customer_email VARCHAR(100),
postcode VARCHAR(100),
);
GO
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mark davis', 'mdavis@yahoo.co.uk', 'me18 7ed');
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mark davis', 'mdavis@hotmail..co.uk', 'me18 7ed');
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mark davis', 'mark.davis@Gmail..co.uk', 'me18 7ed');
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mark davis', 'bigboymark@Gmail..co.uk', 'me18 7ed');
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mike pervis', 'mark.pervis@hotmail.co.uk', 'SW1 5et');
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mike pervis', 'markyp@Gmail.co.uk', 'SW1 5et');
GO
UPDATE A
SET alternateemail = ''
FROM #tempa A
Proposed solution:
UPDATE UpdateSet
SET alternateemail = csv
FROM (
SELECT A.alternateemail,
csv = STUFF(EmailList.csv, 1, 1, SPACE(0))
FROM #tempa A
CROSS
APPLY (
SELECT ',' + B.customer_email
FROM #tempb B
WHERE B.postcode = A.postcode
AND B.customer_full_name = A.customer_full_name
ORDER BY B.idcol
FOR XML PATH(''), TYPE
) Concatenated (xml_emails)
CROSS
APPLY (
SELECT Concatenated.xml_emails.value('./text()[1]', 'VARCHAR(MAX)')
) EmailList (csv)
) UpdateSet
WHERE csv IS NOT NULL;
GO
DROP TABLE #tempa, #tempb
Could you verify that it produces the results you want?
Paul
edit: added an ORDER BY to the FOR XML PATH, just cos I can 🙂
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
June 3, 2010 at 5:11 am
Alternative Solution
UPDATE #TempA
SET AlternateEmail=
LEFT(B.ALT,LEN(B.ALT)-1)
FROM
(
SELECT Customer_Full_Name,
CASE WHEN ISNULL([1],'')='' THEN ''
ELSE [1]+',' END
+CASE WHEN ISNULL([2],'')='' THEN ''
ELSE [2]+',' END
+CASE WHEN ISNULL([3],'')='' THEN ''
ELSE [3]+',' END
+CASE WHEN ISNULL([4],'')='' THEN ''
ELSE [4]+',' END
+CASE WHEN ISNULL([5],'')='' THEN ''
ELSE [5]+',' END
AS ALT, PostCode
FROM
( SELECT ROW_NUMBER() OVER(PARTITION BY Customer_Full_Name ORDER BY Customer_Full_Name)Idcol
,Customer_Full_Name,Customer_Email,PostCode FROM #tempb) P
PIVOT
(
MAX(Customer_Email)
FOR Idcol
IN ([1],[2],[3],[4],[5])
) PVT
) B
INNER JOIN #Tempa
ON B.Customer_Full_Name=#Tempa.Customer_Full_Name
AND B.PostCode=#Tempa.PostCode
SELECT * FROM #TempA
June 3, 2010 at 5:39 am
Hey guys,
Thanks to you both for replying and helping with the solution to this. They both work PERFECTLY :-D:-D
This is MASIVELY appreciated. Thank you SO much.
Matt
June 3, 2010 at 5:44 am
Gopi,
That's good, but I don't think we can safely hard-code the IdCol values like that. I'm pretty sure there are more than 5 records...
Paul
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
June 3, 2010 at 5:57 am
Here's the slightly more optimised version I promised.
It uses an integer hash of the customer_full_name and postcode to quickly find potential matches without huge string matching.
Only potential matches are then compared string-to-string.
Setup: (hash computed column and indexes added)
CREATE TABLE #tempa
(
customer_full_name VARCHAR(100) NOT NULL,
customer_email VARCHAR(100) NOT NULL,
postcode VARCHAR(100) NOT NULL,
alternateemail VARCHAR(MAX) NULL,
hash_pc_cfn AS CHECKSUM(postcode, customer_full_name),
);
GO
INSERT #tempa (customer_full_name,customer_email,postcode)
VALUES ('mark davis', 'mdavis@yahoo.co.uk', 'me18 7ed');
INSERT #tempa (customer_full_name,customer_email,postcode)
VALUES ('mark travis', 'mtravis@yahoo.co.uk', 'SE18 5et');
INSERT #tempa (customer_full_name,customer_email,postcode)
VALUES ('mike pervis', 'mpervis@hotmail.co.uk', 'SW1 5et');
INSERT #tempa (customer_full_name,customer_email,postcode)
VALUES ('Frank line', 'fline@Btconnect.co.uk', 'YR3 9RT');
GO
CREATE TABLE #tempb
(
idcol INTEGER IDENTITY(1,1) PRIMARY KEY,
customer_full_name VARCHAR(100),
customer_email VARCHAR(100),
postcode VARCHAR(100),
hash_pc_cfn AS CHECKSUM(postcode, customer_full_name),
);
GO
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mark davis', 'mdavis@yahoo.co.uk', 'me18 7ed');
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mark davis', 'mdavis@hotmail..co.uk', 'me18 7ed');
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mark davis', 'mark.davis@Gmail..co.uk', 'me18 7ed');
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mark davis', 'bigboymark@Gmail..co.uk', 'me18 7ed');
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mike pervis', 'mark.pervis@hotmail.co.uk', 'SW1 5et');
INSERT #tempb (customer_full_name,customer_email,postcode)
VALUES ('mike pervis', 'markyp@Gmail.co.uk', 'SW1 5et');
GO
-- Hash indexes to speed string matching
CREATE INDEX [IX #tempa hash] ON #tempa (hash_pc_cfn) INCLUDE (postcode, customer_full_name);
CREATE INDEX [IX #tempb hash] ON #tempb (hash_pc_cfn) INCLUDE (postcode, customer_full_name, customer_email);;
GO
Solution:
UPDATE UpdateSet
SET alternateemail = csv
FROM (
SELECT A.alternateemail,
csv = STUFF(EmailList.csv, 1, 1, SPACE(0))
FROM #tempa A
CROSS
APPLY (
SELECT ',' + B.customer_email
FROM #tempb B
WHERE B.hash_pc_cfn = A.hash_pc_cfn
AND B.postcode = A.postcode
AND B.customer_full_name = A.customer_full_name
ORDER BY B.idcol
FOR XML PATH(''), TYPE
) Concatenated (xml_emails)
CROSS
APPLY (
SELECT Concatenated.xml_emails.value('./text()[1]', 'VARCHAR(MAX)')
) EmailList (csv)
WHERE Concatenated.xml_emails IS NOT NULL
) UpdateSet;
GO
SELECT customer_full_name, customer_email, postcode, alternateemail
FROM #tempa;
GO
DROP TABLE #tempa, #tempb
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
June 3, 2010 at 6:28 am
That's good, but I don't think we can safely hard-code the IdCol values like that. I'm pretty sure there are more than 5 records...
Agreed, Alternatively we can find maximum Emails for any customer and we can use that value, I think CROSS JOIN on huge volumes may degrade performace
June 3, 2010 at 8:47 am
[font="Courier New"]UPDATE #tempa
SET alternateemail=emails.AlternateEmails
FROM #tempa
INNER JOIN (SELECT
customer_full_name, postcode,
SUBSTRING(COALESCE((SELECT ','+customer_email
FROM
#tempb b
WHERE
a.customer_full_name=b.customer_full_name
AND a.postcode=b.postcode
FOR XML PATH('')
), ','), 2, 8000) AS alternateEmails
FROM
#tempa a
) emails
ON #tempa.customer_full_name=emails.customer_full_name
AND #tempa.postcode=emails.postcode[/font]
Best wishes,
Phil Factor
June 4, 2010 at 8:02 am
Thanks again for these guys.
After lots of testing against the data that i have (around 19,000 rows) it turned out that the best way to do it was using the PIVOT. This was by far the quickest.
It is still massively appreciated to have all your solutions though. They all worked perfectly for what i wanted to do. The volume of data that i was using made the difference though.
Thanks again,
Best regards,
Matt
June 4, 2010 at 8:11 am
Thank you for proving my argument:-)
June 4, 2010 at 7:46 pm
Gopi Muluka (6/3/2010)
I think CROSS JOIN on huge volumes may degrade performace
:unsure: What cross join?
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
June 4, 2010 at 9:17 pm
My bad, Its CROSS APPLY... on a side note even CROSS JOIN will fare better than CROSS APPLY
June 4, 2010 at 9:52 pm
Gopi Muluka (6/4/2010)
My bad, Its CROSS APPLY... on a side note even CROSS JOIN will fare better than CROSS APPLY
Why do you say that? Another mistake?
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply