July 11, 2011 at 12:37 am
Here is the scenario, we have a Customer DB that has duplicate Customers with different CUSTID values. We plan on keeping the oldest CUSTOMER Record and merge any newer records into the Oldest one and cancel the newer records:
tblDUPS:
CUSTNAME CUSTID CUSTID2 MatchField
Richard Smith 101 102 NAME
Dick Smith 102 105 NAME
Robert Bug 103 104 CITY
Dickie Smith 105 106 CITY
The objective is relate all three rows to the smallest CUSTID and PIVOT the MatchField values:
RESULT SET
CUSTID1 CUSTID2 NAME DLN CITY
101 102 Y N N
101 105 N Y N
101 106 N N Y
103 104 N N Y
In the example above, I have already identified the duplicate rows in the the Customer Table and populated table tblDUPS. I'm just having mental block on rendering the last result set. There is no number to the amount of duplicates that can be in the table. In the example above CUSTID's 101,102,105, and 106 all belong to the same customer. Eventually, we will roll all purchase history into the smallest (oldest) CUSTID's (101,103) and Cancel the high CUSTID's (102,104,105,106).
Any idea's?
July 11, 2011 at 1:26 am
Hi hawaiian,
Can you try following code
select
isnull(tt.custid, t.custid) custid,
t.custid2,
NAME = CASE when t.MatchField = 'NAME' then 'Y' else 'N' end,
DLN = CASE when t.MatchField = 'CITY' then 'Y' else 'N' end,
CITY = CASE when t.MatchField = 'CITY' then 'Y' else 'N' end
from tblDUPS t
left join tblDUPS tt on t.custid = tt.custid2
Sorry for the DLN column, I have no idea how to populate it
July 11, 2011 at 4:20 pm
Provide DDL, DML for sample data and expected results relative to the sample data please.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
July 14, 2011 at 1:02 am
Below is the DDL. I should explain that the matched field column means the to customers matched on either their Drivers License Number (DLN); Last 4 of their Social Security Number (SSN); or some combination of their name.
You will notice after the table build out that Customer 102 matches to customer 105 based on a Name Combination Match. Customer 105 also matches to customer 106 on a Name Combination Match. The Name Combination Match query compares the First letter of the first name and some combination of the Last Name.
Here is a real life example:
Customer ID First Name Last Name
102 Mary Elizabeth Del La Rosa
105 Mary Rosa
106 M Elizabeth Rosa
The customer table is compared to itsself to find duplicates, but to keep the customer ID from showig up on both sides of the results there is a WHERE clause that states WHERE CustID1 < CustID2. So the problem is 102 will relate to 105 on a name match and 105 will relate to 106 on a name match. The solution should:
Origianl Data
101102DLN
101103SSN
102105NAME
105106NAME
Pass1
Relate 101 to 102
Relate 101 to 103
Relate 101 to 105
Relate 102 to 106
Pass2
Relate 101 to 102
Relate 101 to 103
Relate 101 to 105
Relate 101 to 106
The code below will get me to Pass 1. I tried it with an Update Statement and a Loop but that had its own problem. Hence, a recursion problem.
USE [SQLDEVL]
GO
/****** Object: Table [dbo].[tblCustomerIDs] Script Date: 07/13/2011 22:32:37 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[tblDUPS]') AND type in (N'U'))
DROP TABLE [dbo].[tblDUPS]
GO
USE [SQLDEVL]
GO
/****** Object: Table [dbo].[tblCustomerIDs] Script Date: 07/13/2011 22:32:37 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[tblDUPS](
[custid] [int] NULL,
[custid2] [int] NULL,
[MatchField] [CHAR](4)
) ON [PRIMARY]
GO
INSERT INTO tblDUPS (custid,custid2,MatchField)
VALUES(101,102,'DLN')
GO
INSERT INTO tblDUPS (custid,custid2,MatchField)
VALUES(101,103,'SSN')
GO
INSERT INTO tblDUPS (custid,custid2,MatchField)
VALUES(102,105,'NAME')
GO
INSERT INTO tblDUPS (custid,custid2,MatchField)
VALUES(105,106,'NAME')
GO
SELECT * FROM tblDUPS
GO
SELECT
ISNULL(tt.custid, t.custid) AS custid,
t.custid2,
NAME = CASE
WHEN t.MatchField = 'NAME' THEN 'Y' ELSE 'N'
END,
DLN = CASE
WHEN t.MatchField = 'DLN' THEN 'Y' ELSE 'N'
END,
CITY = CASE
WHEN t.MatchField = 'CITY' THEN 'Y' ELSE 'N'
END
FROM
tblDUPS AS t LEFT JOIN tblDUPS AS tt ON t.custid = tt.custid2;
July 15, 2011 at 1:53 am
BTT
July 15, 2011 at 12:37 pm
See if this approach will work for you. If you have questions about my choices let me know.
There may be more efficient ways to do this too...others may drop in and contribute new or improve on this.
USE tempdb
GO
IF EXISTS ( SELECT *
FROM sys.objects
WHERE object_id = OBJECT_ID(N'[dbo].[tblDUPS]')
AND type IN (N'U') )
DROP TABLE [dbo].[tblDUPS]
GO
CREATE TABLE [dbo].[tblDUPS]
(
[custid] [int] NULL,
[custid2] [int] NULL,
[MatchField] [CHAR](4)
)
GO
INSERT INTO tblDUPS
(custid, custid2, MatchField)
VALUES (101, 102, 'DLN'),
(102, 105, 'NAME'),
(105, 106, 'NAME'),
(101, 103, 'SSN'),
(109, 110, 'NAME'),
(110, 112, 'NAME'),
(112, 113, 'NAME') ;
GO
IF EXISTS ( SELECT *
FROM sys.objects
WHERE object_id = OBJECT_ID(N'dbo.get_leaves')
AND type IN (N'FN', N'IF', N'TF', N'FS', N'FT') )
DROP FUNCTION dbo.get_leaves ;
GO
CREATE FUNCTION dbo.get_leaves (@custid INT)
RETURNS TABLE
AS
RETURN
(WITH cte(custid, custid2, [level])
AS (
SELECT DISTINCT
t.custid AS custid,
t.custid2 AS custid2,
0 AS [level]
FROM dbo.tblDUPS t
WHERE custid = @custid
UNION ALL
SELECT t.custid,
t.custid2,
cte.[level] + 1 AS [level]
FROM cte
JOIN dbo.tblDUPS t ON cte.custid2 = t.custid
),
cte2
AS (
SELECT custid,
custid2,
level,
ROW_NUMBER() OVER (PARTITION BY custid2 ORDER BY level DESC) AS row_num
FROM cte
)
SELECT DISTINCT
@custid AS custid,
custid2
FROM cte2
WHERE row_num = 1 ) ;
GO
SELECT DISTINCT
t.custid,
gl.custid2
FROM dbo.tblDUPS t
CROSS APPLY dbo.get_leaves(t.custid) gl
WHERE NOT EXISTS ( SELECT *
FROM dbo.tblDUPS
WHERE custid2 = t.custid )
GO
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
July 16, 2011 at 5:26 am
opc.three you got it; hence, you got the problem. Two things I want to stress:
I did come up with a solution but it was very ... ugly and worked for small groups of data. When I process all 6k ROWS, it failed. So it wasn’t really much of a solution.
Secondly, I made up this scenario. There is no Customer table that begins with tbl. A record is composed of one or more rows from one or more tables. A tuple is a row but you never hear it being used today.
Anyway, there was no way I could post my customers data DDL. What is true is the fact that the database has multiple "customers" because people can register 200 times in a day with 200 different names or veriations of their name. I was tasked with finding duplicate registrations, reporting on them, and prevent any new duplicates to enter the system.
A human being has to actually verify if the two customers are in fact the same person. They can then use a merge program to merge the newer customer registration and their history into the oldest customer ID. It would be very helpful if the entire related customer ID's were group as opc has shown me.
Out of 2.1 million customers there are 6 thousand duplicate pairs. Identifying them was easy, reporting on them was a challenge.
Thank you again opc!!!
Have a great day
Hawaiian
July 16, 2011 at 7:15 am
You're welcome Hawaiian! I am happy it will work for you...enjoy the weekend 🙂
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
July 18, 2011 at 4:02 pm
Heheeeee. I got mad props from my customer 🙂
I couldn't in good conscience take all the credit
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply