Complicated Deduping

  • Hi all - I have a tricky problem that I'm trying to solve, and not really getting anywhere.  I'm hoping someone might have some ideas.

    Basically we receive data from a call centre that is quite messy and completely unnormalised (as I'm sure is usual) and I'm trying to make some use out of it.  Here is a simplified form of the data:

    ID CustID PhoneID

    1    1        10

    2    2        10

    3    3        11

    4    1        12

    5    2        12

    6    4        13

    7    5        11

    8    5        15

    9    5        18

    Some Customers are given multiple CustIDs, and multiple PhoneIDs, so I'm trying to collapse the data on both these fields to identify individual customers.

    In the above example, row IDs (1, 2, 4, 5), (3, 7, 8, 9) and (6) would constitute 3 individual customers after grouping by both CustID and PhoneID

    Now my question is does anyone know a way to do this with a sql script?  I've tried doing an inner join on CustID OR PhoneID, but that obviously doesn't work.  I can't think of any other apporaches other than going through it line by line, which I'd rather avoid given that the table has nearly a million records.

    Any help or pointers or even just thoughts would be much appreciated, thanks.

    Cheers,

    Matt

  • There must be a smart way of doing this but I just cannot think of it

    I tried this

    SELECT A,MIN(B) AS FROM (

    SELECT a.[ID] AS [A],b.[ID] AS

    FROM

    a

    INNER JOIN

    b on b.[CustID] = a.[CustID]

    UNION

    SELECT a.[ID],b.[ID]

    FROM

    a

    INNER JOIN

    b on b.[PhoneID] = a.[PhoneID]

    UNION

    SELECT a.[ID],c.[ID]

    FROM

    a

    INNER JOIN

    b on b.[CustID] = a.[CustID]

    INNER JOIN

    c on c.[PhoneID] = b.[PhoneID]

    UNION

    SELECT a.[ID],c.[ID]

    FROM

    a

    INNER JOIN

    b on b.[PhoneID] = a.[PhoneID]

    INNER JOIN

    c on c.[CustID] = b.[CustID]

    ) x GROUP BY A

    ORDER BY B

    which gave the grouping you wanted but I suspect performance will be attrocious

    Far away is close at hand in the images of elsewhere.
    Anon.

  • What is the end result you seek? A single line with counts for each customer? All the detail captured in the database, ordered by customer (= customer + phone)?

    I've had to dabble in call center data, but it's been a couple of years, and I have no clue what you are trying to accomplish with the solution to your problem. Perhaps a little more context or sample output could give us a hint.

    (I'm thinking nested GROUP BY in selection, if it's permissible to summarize)

  • Hey guys

    Thanks for your input so far. Sorry i thought I made it clearer in my initial post, but I'll explain it a bit further.

    Basically in the output I want to know which customers are actually the same people. Obviously every record with CustID 1 is the same person, but we're also counting everyone with the same PhoneID as the same person too. Thus in many cases, the same customer will have multiple CustIDs, and often multiple PhoneIDs also.

    So in this case, there should be 3 unique customers as follows:

    ID CustID PhoneID ActualCust

    1    1        10        A

    2    2        10        A

    3    3        11        B

    4    1        12        A

    5    2        12        A

    6    4        13        C

    7    5        11        B

    8    5        15        B

    9    5        18        B

    I'm not to fussed about the actual output (more IDs or whatever), I just need ot be able to uniquely identify which rows belong to the same Customer.

    I hope things makes things clearer. Thanks again for your help on this. Please let me know if you have any further questions.

    Cheers,

    Matt

  • Also asked here http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=87234

     


    N 56°04'39.16"
    E 12°55'05.25"

  • One suggestion - I'd start by building a reference table that identifies custid, phoneid combinations that refer to a specific customer, then use that in a join to locate calls by customer. Basically, it's faster to look something up than code a complicated rule! 😉

    Steve G.

  • One solution: build a table that will map customer 'A' to all the appropriate phone or cust ids. Then use that relationship coding to summarize in the way you want.

    For instance, taking your initial data:

    ID CustID PhoneID ActualCust

    1 1 10 A

    2 2 10 A

    3 3 11 B

    4 1 12 A

    5 2 12 A

    6 4 13 C

    7 5 11 B

    8 5 15 B

    9 5 18 B

    The ActualCust column has to be derived by the appropriate business rules. As I understand it, you have framed the rules as follows:

    1. If phoneid = previous phoneid in same import table, then customer is same.

    2. If custid = previous custid in same import table, then customer is same.

    3. First customer record becomes 'anchor' for identifying customer. Whatever unique identifier is assigned to first record found is then assigned to all subsequent records for this customer.

    So far, everything i've envisioned requires a (horrors) cursor. I'm still working on how to build a set-based solution.

  • You can create a subquery to pair the lowest number CustID to all CustID values from 2 to N where there is more than one row for any PhoneID. I called this OrigCustID. It will give you a single value to identify each person.

    For the sample data it would look like this:

    OrigCustID CustID

    1 2

    3 5

    CREATE TABLE #CustomerCalls (

    ID int,

    CustID INT,

    PhoneID int,

    ActualCust char(1))

    insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (1,1,10,'A')

    insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (2,2,10,'A')

    insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (3,3,11,'B')

    insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (4,1,12,'A')

    insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (5,2,12,'A')

    insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (6,4,13,'C')

    insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (7,5,11,'B')

    insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (8,5,15,'B')

    insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (9,5,18,'B')

    SELECT

    c.ID, COALESCE(orig.OrigCustID,c.CustID) AS OrigCustID,

    c.CustID, c.PhoneID, c.ActualCust

    FROM #CustomerCalls c

    LEFT OUTER JOIN -- inner table that maps 2nd to Nth CustID to OrigCustID

    (SELECT t2.CustID AS OrigCustID, t1.CustID

    FROM #CustomerCalls t1

    INNER JOIN (SELECT MIN(CustID) AS CustID, PhoneID FROM #CustomerCalls GROUP BY PhoneID) t2 -- the lowest number CustID

    ON t1.phoneID = t2.PhoneID

    WHERE t1.CustID > t2.CustID -- only gives rows 2nd to Nth CustID

    GROUP BY t2.CustID, t1.CustID -- only 1 row for CustID to Orig CustID mapping

    ) orig

    ON c.custID = orig.custID

    ORDER BY c.ID

    -- drop table #CustomerCalls

    If you don’t need to see the multiple CustID values you might use this to load a second table that just has and is grouped by OrigCustID and PhoneID. The rowcount would be reduced.

  • Here is another way.  I'm not sure how efficient it is but it does cover your example and a few others I thought of.

    Basically I'm looping through the table setting a parentid for each row.  Then I'm using that parentid if I can and if not I use the id to create a new parentid.  All you have to do then is count the parent ids.

    CREATE

    TABLE #CustomerCalls (

    ID

    int,

    CustID

    INT,

    PhoneID

    int,

    ActualCust

    char(1),

    ParentId

    Int)

    insert

    into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (1,1,10,'A')

    insert

    into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (2,2,10,'A')

    insert

    into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (3,3,11,'B')

    --insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (4,1,12,'A')

    insert

    into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (5,2,12,'A')

    insert

    into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (6,4,13,'C')

    insert

    into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (7,5,11,'B')

    insert

    into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (8,5,15,'B')

    insert

    into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (9,5,18,'B')

    DECLARE

    @Id Int, @CustId Int, @PhoneId Int, @ParentId Int

    SELECT

    @Id = MIN(Id), @ParentId = MIN(Id) FROM #CustomerCalls

    WHILE

    @Id IS NOT NULL

    BEGIN

    SELECT @CustId = CustId, @PhoneId = PhoneId, @ParentId = ISNULL(ParentId, Id)

    FROM #CustomerCalls WHERE Id = @id

    UPDATE #CustomerCalls SET ParentId = @ParentId

    FROM #CustomerCalls

    WHERE (CustId = @CustId OR PhoneId = @PhoneId)

    AND (@ParentId > ParentId OR ParentId IS NULL)

    SELECT @id = MIN(Id) FROM #CustomerCalls WHERE Id > @Id

    END

    SELECT

    * FROM #CustomerCalls

     

    Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]

  • DECLARE @T TABLE (ID int, CustID int, PhoneID int)

    INSERT INTO @T SELECT 1, 1, 10

    UNION SELECT 2, 2, 10

    UNION SELECT 3, 3, 11

    UNION SELECT 4, 1, 12

    UNION SELECT 5, 2, 12

    UNION SELECT 6, 4, 13

    UNION SELECT 7, 5, 11

    UNION SELECT 8, 5, 15

    UNION SELECT 9, 5, 18

    DECLARE @r TABLE (NewID int, ID int, CustID int, PhoneID int)

    DECLARE @I int, @C int, @P int

    DECLARE C CURSOR FOR

    SELECT ID, CustID, PhoneID FROM @T ORDER BY ID

    OPEN C

    FETCH C INTO @I, @C, @P

    WHILE @@FETCH_STATUS = 0

    BEGIN

    INSERT INTO @r

    SELECT IsNull((SELECT MAX(NewID) FROM @r WHERE CustID = @C OR PhoneID = @P),

    (SELECT IsNull(MAX(NewID), 0) + 1 FROM @r)),

    @I, @C, @P

    FETCH C INTO @I, @C, @P

    END

    CLOSE C

    DEALLOCATE C

    SELECT * FROM @r

  • Hi everyone.  Thanks for your help.  Ideally I wanted to avoid using a cursor or a loop, as it would require multiple passes anyway.  Below is the script I ended up using,  Unfortunately it only works through three levels of CustID/PhoneID/CustID - creating further would have increased the script size exponentially, and it should get most matches this way anyway.

    If anyone has any further comments or suggestions I'd still be very grateful to hear from you, thanks.

    select a.ID, a.CustID, a.PhoneID,

    case when c.ID <= p.ID then c.ID else p.ID end as ParentID

    from dbo.tbCustomer a

        inner join

            (select a.CustID, min(case when b.ID <= c.ID then b.ID else c.ID end) as ID

            from dbo.tbCustomer a

                inner join (select CustID, min(ID) ID from dbo.tbCustomer group by CustID) b

                    on a.CustID = b.CustID

                inner join (select PhoneID, min(ID) ID from dbo.tbCustomer group by PhoneID) c

                    on a.PhoneID = c.PhoneID

            group by a.CustID) as c

        on a.CustID = c.CustID

        inner join

            (select a.PhoneID, min(case when b.ID <= c.ID then b.ID else c.ID end) as ID

            from dbo.tbCustomer a

                inner join (select CustID, min(ID) ID from dbo.tbCustomer group by CustID) b

                    on a.CustID = b.CustID

                inner join (select PhoneID, min(ID) ID from dbo.tbCustomer group by PhoneID) c

                    on a.PhoneID = c.PhoneID

            group by a.PhoneID) as p

        on a.PhoneID = p.PhoneID

     

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply