Complicated Deduping

Question

Post reply

Complicated Deduping

Matt Rigbye-353351

SSC Veteran

Points: 267
More actions
August 2, 2007 at 3:04 am

#179930

Hi all - I have a tricky problem that I'm trying to solve, and not really getting anywhere. I'm hoping someone might have some ideas.
Basically we receive data from a call centre that is quite messy and completely unnormalised (as I'm sure is usual) and I'm trying to make some use out of it. Here is a simplified form of the data:
ID CustID PhoneID
1   1        10
2    2        10
3 3        11
4    1        12
5 2        12
6    4        13
7   5        11
8    5        15
9    5        18
Some Customers are given multiple CustIDs, and multiple PhoneIDs, so I'm trying to collapse the data on both these fields to identify individual customers.
In the above example, row IDs (1, 2, 4, 5), (3, 7, 8, 9) and (6) would constitute 3 individual customers after grouping by both CustID and PhoneID
Now my question is does anyone know a way to do this with a sql script? I've tried doing an inner join on CustID OR PhoneID, but that obviously doesn't work. I can't think of any other apporaches other than going through it line by line, which I'd rather avoid given that the table has nearly a million records.
Any help or pointers or even just thoughts would be much appreciated, thanks.
Cheers,
Matt

Viewing 11 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

David Burrows SSC Guru Points: 65144 More actions · Answer 1

There must be a smart way of doing this but I just cannot think of it

I tried this

SELECT A,MIN(B) AS FROM (

SELECT a.[ID] AS [A],b.[ID] AS

FROM a

INNER JOIN b on b.[CustID] = a.[CustID]

UNION

SELECT a.[ID],b.[ID]

FROM a

INNER JOIN b on b.[PhoneID] = a.[PhoneID]

UNION

SELECT a.[ID],c.[ID]

FROM a

INNER JOIN b on b.[CustID] = a.[CustID]

INNER JOIN c on c.[PhoneID] = b.[PhoneID]

UNION

SELECT a.[ID],c.[ID]

FROM a

INNER JOIN b on b.[PhoneID] = a.[PhoneID]

INNER JOIN c on c.[CustID] = b.[CustID]

) x GROUP BY A

ORDER BY B

which gave the grouping you wanted but I suspect performance will be attrocious

Far away is close at hand in the images of elsewhere.
Anon.

steve smith-401573 SSChampion Points: 10889 More actions · Answer 2

What is the end result you seek? A single line with counts for each customer? All the detail captured in the database, ordered by customer (= customer + phone)?

I've had to dabble in call center data, but it's been a couple of years, and I have no clue what you are trying to accomplish with the solution to your problem. Perhaps a little more context or sample output could give us a hint.

(I'm thinking nested GROUP BY in selection, if it's permissible to summarize)

Matt Rigbye-353351 SSC Veteran Points: 267 More actions · Answer 3

Hey guys

Thanks for your input so far. Sorry i thought I made it clearer in my initial post, but I'll explain it a bit further.

Basically in the output I want to know which customers are actually the same people. Obviously every record with CustID 1 is the same person, but we're also counting everyone with the same PhoneID as the same person too. Thus in many cases, the same customer will have multiple CustIDs, and often multiple PhoneIDs also.

So in this case, there should be 3 unique customers as follows:

ID CustID PhoneID ActualCust

1 1 10 A

2 2 10 A

3 3 11 B

4 1 12 A

5 2 12 A

6 4 13 C

7 5 11 B

8 5 15 B

9 5 18 B

I'm not to fussed about the actual output (more IDs or whatever), I just need ot be able to uniquely identify which rows belong to the same Customer.

I hope things makes things clearer. Thanks again for your help on this. Please let me know if you have any further questions.

Cheers,

Matt

SwePeso SSC-Dedicated Points: 39758 More actions · Answer 4

Also asked here http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=87234

N 56°04'39.16"
E 12°55'05.25"

aureolin Hall of Fame Points: 3521 More actions · Answer 5

One suggestion - I'd start by building a reference table that identifies custid, phoneid combinations that refer to a specific customer, then use that in a join to locate calls by customer. Basically, it's faster to look something up than code a complicated rule! 😉

Steve G.

steve smith-401573 SSChampion Points: 10889 More actions · Answer 6

One solution: build a table that will map customer 'A' to all the appropriate phone or cust ids. Then use that relationship coding to summarize in the way you want.

For instance, taking your initial data:

ID CustID PhoneID ActualCust

1 1 10 A

2 2 10 A

3 3 11 B

4 1 12 A

5 2 12 A

6 4 13 C

7 5 11 B

8 5 15 B

9 5 18 B

The ActualCust column has to be derived by the appropriate business rules. As I understand it, you have framed the rules as follows:

1. If phoneid = previous phoneid in same import table, then customer is same.

2. If custid = previous custid in same import table, then customer is same.

3. First customer record becomes 'anchor' for identifying customer. Whatever unique identifier is assigned to first record found is then assigned to all subsequent records for this customer.

So far, everything i've envisioned requires a (horrors) cursor. I'm still working on how to build a set-based solution.

Eric SSCommitted Points: 1803 More actions · Answer 7

You can create a subquery to pair the lowest number CustID to all CustID values from 2 to N where there is more than one row for any PhoneID. I called this OrigCustID. It will give you a single value to identify each person.

For the sample data it would look like this:

OrigCustID CustID

1 2

3 5

CREATE TABLE #CustomerCalls (

ID int,

CustID INT,

PhoneID int,

ActualCust char(1))

insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (1,1,10,'A')

insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (2,2,10,'A')

insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (3,3,11,'B')

insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (4,1,12,'A')

insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (5,2,12,'A')

insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (6,4,13,'C')

insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (7,5,11,'B')

insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (8,5,15,'B')

insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (9,5,18,'B')

SELECT

c.ID, COALESCE(orig.OrigCustID,c.CustID) AS OrigCustID,

c.CustID, c.PhoneID, c.ActualCust

FROM #CustomerCalls c

LEFT OUTER JOIN -- inner table that maps 2nd to Nth CustID to OrigCustID

(SELECT t2.CustID AS OrigCustID, t1.CustID

FROM #CustomerCalls t1

INNER JOIN (SELECT MIN(CustID) AS CustID, PhoneID FROM #CustomerCalls GROUP BY PhoneID) t2 -- the lowest number CustID

ON t1.phoneID = t2.PhoneID

WHERE t1.CustID > t2.CustID -- only gives rows 2nd to Nth CustID

GROUP BY t2.CustID, t1.CustID -- only 1 row for CustID to Orig CustID mapping

) orig

ON c.custID = orig.custID

ORDER BY c.ID

-- drop table #CustomerCalls

If you don’t need to see the multiple CustID values you might use this to load a second table that just has and is grouped by OrigCustID and PhoneID. The rowcount would be reduced.

Kenneth.Fisher SSC-Insane Points: 20406 More actions · Answer 8

Here is another way. I'm not sure how efficient it is but it does cover your example and a few others I thought of.

Basically I'm looping through the table setting a parentid for each row. Then I'm using that parentid if I can and if not I use the id to create a new parentid. All you have to do then is count the parent ids.

CREATE

TABLE #CustomerCalls (

ID

int,

CustID

INT,

PhoneID

int,

ActualCust

char(1),

ParentId

Int)

insert

into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (1,1,10,'A')

insert

into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (2,2,10,'A')

insert

into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (3,3,11,'B')

--insert into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (4,1,12,'A')

insert

into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (5,2,12,'A')

insert

into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (6,4,13,'C')

insert

into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (7,5,11,'B')

insert

into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (8,5,15,'B')

insert

into #CustomerCalls (ID, CustID, PhoneID, ActualCust) values (9,5,18,'B')

DECLARE

@Id Int, @CustId Int, @PhoneId Int, @ParentId Int

SELECT

@Id = MIN(Id), @ParentId = MIN(Id) FROM #CustomerCalls

WHILE

@Id IS NOT NULL

BEGIN

SELECT @CustId = CustId, @PhoneId = PhoneId, @ParentId = ISNULL(ParentId, Id)

FROM #CustomerCalls WHERE Id = @id

UPDATE #CustomerCalls SET ParentId = @ParentId

FROM #CustomerCalls

WHERE (CustId = @CustId OR PhoneId = @PhoneId)

AND (@ParentId > ParentId OR ParentId IS NULL)

SELECT @id = MIN(Id) FROM #CustomerCalls WHERE Id > @Id

END

SELECT

* FROM #CustomerCalls

Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]

Koji Matsumura SSCrazy Points: 2241 More actions · Answer 9

DECLARE @T TABLE (ID int, CustID int, PhoneID int)

INSERT INTO @T SELECT 1, 1, 10

UNION SELECT 2, 2, 10

UNION SELECT 3, 3, 11

UNION SELECT 4, 1, 12

UNION SELECT 5, 2, 12

UNION SELECT 6, 4, 13

UNION SELECT 7, 5, 11

UNION SELECT 8, 5, 15

UNION SELECT 9, 5, 18

DECLARE @r TABLE (NewID int, ID int, CustID int, PhoneID int)

DECLARE @I int, @C int, @P int

DECLARE C CURSOR FOR

SELECT ID, CustID, PhoneID FROM @T ORDER BY ID

OPEN C

FETCH C INTO @I, @C, @P

WHILE @@FETCH_STATUS = 0

BEGIN

INSERT INTO @r

SELECT IsNull((SELECT MAX(NewID) FROM @r WHERE CustID = @C OR PhoneID = @P),

(SELECT IsNull(MAX(NewID), 0) + 1 FROM @r)),

@I, @C, @P

FETCH C INTO @I, @C, @P

END

CLOSE C

DEALLOCATE C

SELECT * FROM @r

Matt Rigbye-353351 SSC Veteran Points: 267 More actions · Answer 10

Hi everyone. Thanks for your help. Ideally I wanted to avoid using a cursor or a loop, as it would require multiple passes anyway. Below is the script I ended up using, Unfortunately it only works through three levels of CustID/PhoneID/CustID - creating further would have increased the script size exponentially, and it should get most matches this way anyway.

If anyone has any further comments or suggestions I'd still be very grateful to hear from you, thanks.

select a.ID, a.CustID, a.PhoneID,

case when c.ID <= p.ID then c.ID else p.ID end as ParentID

from dbo.tbCustomer a

inner join

(select a.CustID, min(case when b.ID <= c.ID then b.ID else c.ID end) as ID

from dbo.tbCustomer a

inner join (select CustID, min(ID) ID from dbo.tbCustomer group by CustID) b

on a.CustID = b.CustID

inner join (select PhoneID, min(ID) ID from dbo.tbCustomer group by PhoneID) c

on a.PhoneID = c.PhoneID

group by a.CustID) as c

on a.CustID = c.CustID

inner join

(select a.PhoneID, min(case when b.ID <= c.ID then b.ID else c.ID end) as ID

from dbo.tbCustomer a

inner join (select CustID, min(ID) ID from dbo.tbCustomer group by CustID) b

on a.CustID = b.CustID

inner join (select PhoneID, min(ID) ID from dbo.tbCustomer group by PhoneID) c

on a.PhoneID = c.PhoneID

group by a.PhoneID) as p

on a.PhoneID = p.PhoneID