Creating Comma seperated list of details from multiple columns

  • Hi All,

    I am trying to find a way to add into a table a flattened (comma seperated list) of email addresses based on the multiple columns of nformation in another table (joined by customer_full_name and postcode.

    This is to highlight duplicate email addresses for people under the same customer_full_name and Postcode.

    I have done this using a loop which loops through concatenating the email addresses but it takes 1minute to do 1000. The table is 19,000 so this isn't really acceptable. I have tried temp tables, table variables and none of this seems to make any difference. I think that it is becuase i am joining on text columns?

    I know you guys are MASSIVE fans of the tally table, but i have not quite got my head around how i can adapt it to make it work for this.

    any hints or tips would be massively appreciated.

    Please find some sample data below.

    MASSIVE thanks,

    Matt

    Create table #tempa

    (

    customer_Full_Name varchar(100),

    Customer_Email varchar(100),

    Postcode varchar(100),

    AlternateEmail varchar(max)NULL

    )

    insert into #tempa (customer_full_name,customer_email,postcode)

    select 'mark davis','mdavis@yahoo.co.uk','me18 7ed'

    insert into #tempa (customer_full_name,customer_email,postcode)

    select 'mark travis','mtravis@yahoo.co.uk','SE18 5et'

    insert into #tempa (customer_full_name,customer_email,postcode)

    select 'mike pervis','mpervis@hotmail.co.uk','SW1 5et'

    insert into #tempa (customer_full_name,customer_email,postcode)

    select 'Frank line','fline@Btconnect.co.uk','YR3 9RT'

    Create table #tempb

    (

    Idcol int identity (1,1),

    Customer_Full_Name varchar(100),

    Customer_Email varchar(100),

    Postcode varchar(100)

    )

    insert into #tempb (customer_full_name,customer_email,postcode)

    select 'mark davis','mdavis@yahoo.co.uk','me18 7ed'

    insert into #tempb (customer_full_name,customer_email,postcode)

    select 'mark davis','mdavis@hotmail..co.uk','me18 7ed'

    insert into #tempb (customer_full_name,customer_email,postcode)

    select 'mark davis','mark.davis@Gmail..co.uk','me18 7ed'

    insert into #tempb (customer_full_name,customer_email,postcode)

    select 'mark davis','bigboymark@Gmail..co.uk','me18 7ed'

    insert into #tempb (customer_full_name,customer_email,postcode)

    select 'mike pervis','mark.pervis@hotmail.co.uk','SW1 5et'

    insert into #tempb (customer_full_name,customer_email,postcode)

    select 'mike pervis','markyp@Gmail.co.uk','SW1 5et'

  • Hey Matt,

    So that's reasonably clear, but what output do you want? Do you want a list of matches between the two tables on email address AND postcode, where there is more than one match? Do you want to update the alternate_email column with the comma-separated list of email addresses...or something else?

    Paul

  • Hi Paul,

    Thanks for the response.

    what i am hoping for is to update the alternate email culumn with all of the email addresses that we have listed for that customer i.e.

    Mark.davis@gmail.co.uk, Mdavis @hotmail.co.uk, mdavis@yahoo.co.uk

    I have this at the moment which is grouping the customers together and ranking them, but i cant figure out (wihout a loop!) how to get this flattened down into the comma seperated list!

    with cte as (

    select

    ROW_NUMBER() over(PARTITION BY b.customer_full_name order by b.customer_full_name)idcol,

    b.customer_full_name,b.customer_email

    from #tempb b join #tempa a on a.customer_full_name = b.customer_full_name

    group by b.customer_full_name,b.customer_email

    )

    ANY help is very much appreciated.

    Thanks

    Matt

  • Ok, I need to do some stuff with a hash and an index to make this fly for larger sets, but as a demonstration only:

    Setup script (slightly modified)

    CREATE TABLE #tempa

    (

    customer_full_name VARCHAR(100) NOT NULL,

    customer_email VARCHAR(100) NOT NULL,

    postcode VARCHAR(100) NOT NULL,

    alternateemail VARCHAR(MAX) NULL,

    );

    GO

    INSERT #tempa (customer_full_name,customer_email,postcode)

    VALUES ('mark davis', 'mdavis@yahoo.co.uk', 'me18 7ed');

    INSERT #tempa (customer_full_name,customer_email,postcode)

    VALUES ('mark travis', 'mtravis@yahoo.co.uk', 'SE18 5et');

    INSERT #tempa (customer_full_name,customer_email,postcode)

    VALUES ('mike pervis', 'mpervis@hotmail.co.uk', 'SW1 5et');

    INSERT #tempa (customer_full_name,customer_email,postcode)

    VALUES ('Frank line', 'fline@Btconnect.co.uk', 'YR3 9RT');

    GO

    CREATE TABLE #tempb

    (

    idcol INTEGER IDENTITY(1,1) PRIMARY KEY,

    customer_full_name VARCHAR(100),

    customer_email VARCHAR(100),

    postcode VARCHAR(100),

    );

    GO

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mark davis', 'mdavis@yahoo.co.uk', 'me18 7ed');

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mark davis', 'mdavis@hotmail..co.uk', 'me18 7ed');

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mark davis', 'mark.davis@Gmail..co.uk', 'me18 7ed');

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mark davis', 'bigboymark@Gmail..co.uk', 'me18 7ed');

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mike pervis', 'mark.pervis@hotmail.co.uk', 'SW1 5et');

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mike pervis', 'markyp@Gmail.co.uk', 'SW1 5et');

    GO

    UPDATE A

    SET alternateemail = ''

    FROM #tempa A

    Proposed solution:

    UPDATE UpdateSet

    SET alternateemail = csv

    FROM (

    SELECT A.alternateemail,

    csv = STUFF(EmailList.csv, 1, 1, SPACE(0))

    FROM #tempa A

    CROSS

    APPLY (

    SELECT ',' + B.customer_email

    FROM #tempb B

    WHERE B.postcode = A.postcode

    AND B.customer_full_name = A.customer_full_name

    ORDER BY B.idcol

    FOR XML PATH(''), TYPE

    ) Concatenated (xml_emails)

    CROSS

    APPLY (

    SELECT Concatenated.xml_emails.value('./text()[1]', 'VARCHAR(MAX)')

    ) EmailList (csv)

    ) UpdateSet

    WHERE csv IS NOT NULL;

    GO

    DROP TABLE #tempa, #tempb

    Could you verify that it produces the results you want?

    Paul

    edit: added an ORDER BY to the FOR XML PATH, just cos I can 🙂

  • Alternative Solution

    UPDATE #TempA

    SET AlternateEmail=

    LEFT(B.ALT,LEN(B.ALT)-1)

    FROM

    (

    SELECT Customer_Full_Name,

    CASE WHEN ISNULL([1],'')='' THEN ''

    ELSE [1]+',' END

    +CASE WHEN ISNULL([2],'')='' THEN ''

    ELSE [2]+',' END

    +CASE WHEN ISNULL([3],'')='' THEN ''

    ELSE [3]+',' END

    +CASE WHEN ISNULL([4],'')='' THEN ''

    ELSE [4]+',' END

    +CASE WHEN ISNULL([5],'')='' THEN ''

    ELSE [5]+',' END

    AS ALT, PostCode

    FROM

    ( SELECT ROW_NUMBER() OVER(PARTITION BY Customer_Full_Name ORDER BY Customer_Full_Name)Idcol

    ,Customer_Full_Name,Customer_Email,PostCode FROM #tempb) P

    PIVOT

    (

    MAX(Customer_Email)

    FOR Idcol

    IN ([1],[2],[3],[4],[5])

    ) PVT

    ) B

    INNER JOIN #Tempa

    ON B.Customer_Full_Name=#Tempa.Customer_Full_Name

    AND B.PostCode=#Tempa.PostCode

    SELECT * FROM #TempA

  • Hey guys,

    Thanks to you both for replying and helping with the solution to this. They both work PERFECTLY :-D:-D

    This is MASIVELY appreciated. Thank you SO much.

    Matt

  • Gopi,

    That's good, but I don't think we can safely hard-code the IdCol values like that. I'm pretty sure there are more than 5 records...

    Paul

  • Here's the slightly more optimised version I promised.

    It uses an integer hash of the customer_full_name and postcode to quickly find potential matches without huge string matching.

    Only potential matches are then compared string-to-string.

    Setup: (hash computed column and indexes added)

    CREATE TABLE #tempa

    (

    customer_full_name VARCHAR(100) NOT NULL,

    customer_email VARCHAR(100) NOT NULL,

    postcode VARCHAR(100) NOT NULL,

    alternateemail VARCHAR(MAX) NULL,

    hash_pc_cfn AS CHECKSUM(postcode, customer_full_name),

    );

    GO

    INSERT #tempa (customer_full_name,customer_email,postcode)

    VALUES ('mark davis', 'mdavis@yahoo.co.uk', 'me18 7ed');

    INSERT #tempa (customer_full_name,customer_email,postcode)

    VALUES ('mark travis', 'mtravis@yahoo.co.uk', 'SE18 5et');

    INSERT #tempa (customer_full_name,customer_email,postcode)

    VALUES ('mike pervis', 'mpervis@hotmail.co.uk', 'SW1 5et');

    INSERT #tempa (customer_full_name,customer_email,postcode)

    VALUES ('Frank line', 'fline@Btconnect.co.uk', 'YR3 9RT');

    GO

    CREATE TABLE #tempb

    (

    idcol INTEGER IDENTITY(1,1) PRIMARY KEY,

    customer_full_name VARCHAR(100),

    customer_email VARCHAR(100),

    postcode VARCHAR(100),

    hash_pc_cfn AS CHECKSUM(postcode, customer_full_name),

    );

    GO

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mark davis', 'mdavis@yahoo.co.uk', 'me18 7ed');

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mark davis', 'mdavis@hotmail..co.uk', 'me18 7ed');

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mark davis', 'mark.davis@Gmail..co.uk', 'me18 7ed');

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mark davis', 'bigboymark@Gmail..co.uk', 'me18 7ed');

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mike pervis', 'mark.pervis@hotmail.co.uk', 'SW1 5et');

    INSERT #tempb (customer_full_name,customer_email,postcode)

    VALUES ('mike pervis', 'markyp@Gmail.co.uk', 'SW1 5et');

    GO

    -- Hash indexes to speed string matching

    CREATE INDEX [IX #tempa hash] ON #tempa (hash_pc_cfn) INCLUDE (postcode, customer_full_name);

    CREATE INDEX [IX #tempb hash] ON #tempb (hash_pc_cfn) INCLUDE (postcode, customer_full_name, customer_email);;

    GO

    Solution:

    UPDATE UpdateSet

    SET alternateemail = csv

    FROM (

    SELECT A.alternateemail,

    csv = STUFF(EmailList.csv, 1, 1, SPACE(0))

    FROM #tempa A

    CROSS

    APPLY (

    SELECT ',' + B.customer_email

    FROM #tempb B

    WHERE B.hash_pc_cfn = A.hash_pc_cfn

    AND B.postcode = A.postcode

    AND B.customer_full_name = A.customer_full_name

    ORDER BY B.idcol

    FOR XML PATH(''), TYPE

    ) Concatenated (xml_emails)

    CROSS

    APPLY (

    SELECT Concatenated.xml_emails.value('./text()[1]', 'VARCHAR(MAX)')

    ) EmailList (csv)

    WHERE Concatenated.xml_emails IS NOT NULL

    ) UpdateSet;

    GO

    SELECT customer_full_name, customer_email, postcode, alternateemail

    FROM #tempa;

    GO

    DROP TABLE #tempa, #tempb

  • That's good, but I don't think we can safely hard-code the IdCol values like that. I'm pretty sure there are more than 5 records...

    Agreed, Alternatively we can find maximum Emails for any customer and we can use that value, I think CROSS JOIN on huge volumes may degrade performace

  • [font="Courier New"]UPDATE  #tempa

    SET  alternateemail=emails.AlternateEmails

    FROM  #tempa

      INNER JOIN (SELECT

                    customer_full_name, postcode,

                    SUBSTRING(COALESCE((SELECT ','+customer_email

                                        FROM

                                          #tempb b

                                        WHERE

                                          a.customer_full_name=b.customer_full_name

                                          AND a.postcode=b.postcode

                                        FOR XML PATH('')

                                       ), ','), 2, 8000) AS alternateEmails

                  FROM

                    #tempa a

                 ) emails

        ON #tempa.customer_full_name=emails.customer_full_name

           AND #tempa.postcode=emails.postcode[/font]

    Best wishes,
    Phil Factor

  • Thanks again for these guys.

    After lots of testing against the data that i have (around 19,000 rows) it turned out that the best way to do it was using the PIVOT. This was by far the quickest.

    It is still massively appreciated to have all your solutions though. They all worked perfectly for what i wanted to do. The volume of data that i was using made the difference though.

    Thanks again,

    Best regards,

    Matt

  • Thank you for proving my argument:-)

  • Gopi Muluka (6/3/2010)


    I think CROSS JOIN on huge volumes may degrade performace

    :unsure: What cross join?

  • My bad, Its CROSS APPLY... on a side note even CROSS JOIN will fare better than CROSS APPLY

  • Gopi Muluka (6/4/2010)


    My bad, Its CROSS APPLY... on a side note even CROSS JOIN will fare better than CROSS APPLY

    Why do you say that? Another mistake?

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply