July 13, 2013 at 11:35 pm
Dear,
I have a customer table containing many duplicate names and custid is the unique key. But the customer names are not 100% similar. For example,
CustIDCustName
--------------------
100ABC CO
101ABC CO.
102ABC CO&
103ABC CO,NY
Here all the custname refers to similar customer. Now I wanna delete the duplicate names. I need to search custnames which are 80% similar and delete those customers.
Please help me to do this.
Regards,
Akbar
July 14, 2013 at 9:38 am
shohelr2003 (7/13/2013)
Dear,I have a customer table containing many duplicate names and custid is the unique key. But the customer names are not 100% similar. For example,
CustIDCustName
--------------------
100ABC CO
101ABC CO.
102ABC CO&
103ABC CO,NY
Here all the custname refers to similar customer. Now I wanna delete the duplicate names. I need to search custnames which are 80% similar and delete those customers.
Please help me to do this.
Regards,
Akbar
I have a suspicion that you have over simplified your data,,,,,,does the company name always come first as you have described?
do you have any other customer related columns ( telephone/zipcode/address etc) that could help with mapping duplicates?
what is the scale of the problem,,,,ie no of rows?
how do you define 80% match?
regards
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
July 14, 2013 at 2:44 pm
This is a non-trivial problem, and a T-SQL only solution is impossible. If you are on SQL 2012, you should have a look at Data Quality Services. If you are on SQL 2008, the Data Profiling task in SSIS can help. (But then again, that is when you load the data, not when already have in the table).
[font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]
July 14, 2013 at 7:37 pm
See article by Dwain Camps 2012/11/29
http://www.sqlservercentral.com/articles/String+Manipulation/94365/[/url]
His article on pattern matching may provide a solution. Based on your limited example, I came up with this
script that works on that limited set (the pattern split function is posted below). I can't guarantee that
it's a universal solution to the problem.
Sample data
IF OBJECT_ID('tempdb..#TempTable') IS NOT NULL
DROP TABLE #TempTable
CREATE TABLE #TempTable
(
ID INT IDENTITY(1,1) NOT NULL
,CustID INT NOT NULL
,CustName VARCHAR(1000) NULL
,PRIMARY KEY CLUSTERED (CustID)
)
;WITH cteSampleData (CustID,CustName)
AS
(
SELECT 100,'ABC CO' UNION ALL
SELECT 101,'ABC CO.' UNION ALL
SELECT 102,'ABC CO&' UNION ALL
SELECT 103,'ABC CO,NY' UNION ALL
SELECT 104,'XYZ Corp' UNION ALL
SELECT 105,'XYZ Corporation' UNION ALL
SELECT 106,'XYZ Corp CA'
)
INSERT INTO #TempTable
(CustID,CustName)
SELECT
CustID,
CustName
FROM
cteSampleData
Script (requires function PatternSplitCM...see below)
SELECT
CustID
,BaseName
,Dupe
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY ps.Item ORDER BY t.ID) AS rowNum
,MAX(LEN(t.CustName)) OVER (PARTITION BY t.CustName) AS maxLen
,t.CustID
,ps.Item AS BaseName
,t.CustName as Dupe
FROM
#TempTable AS t
CROSS APPLY
dbo.PatternSplitCM(t.CustName,'[A-Za-z0-9]') AS ps
WHERE
ps.ItemNumber = 1
) r
WHERE
rowNum > 1
AND LEN(Dupe) >= maxLen
Output:
CustIDBaseNameDupe
102ABCABC CO&
103ABCABC CO,NY
101ABCABC CO.
106XYZXYZ Corp CA
105XYZXYZ Corporation
/* See article by Dwain Camps 2012/11/29 */
/* http://www.sqlservercentral.com/articles/String+Manipulation/94365/ */
CREATE FUNCTION [dbo].[PatternSplitCM] (@List VARCHAR(8000) = NULL, @Pattern VARCHAR(50))
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH numbers AS (
SELECT TOP (ISNULL(DATALENGTH(@List), 0)) n = ROW_NUMBER() OVER (
ORDER BY (
SELECT NULL
)
)
FROM (
VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)
) d(n), (
VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)
) e(n), (
VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)
) f(n), (
VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)
) g(n)
)
SELECT ItemNumber = ROW_NUMBER() OVER (
ORDER BY MIN(n)
), Item = SUBSTRING(@List, MIN(n), 1 + MAX(n) - MIN(n)), [Matched]
FROM (
SELECT n, y.[Matched], Grouper = n - ROW_NUMBER() OVER (
ORDER BY y.[Matched], n
)
FROM numbers
CROSS APPLY (
SELECT [Matched] = CASE
WHEN SUBSTRING(@List, n, 1) LIKE @Pattern
THEN 1
ELSE 0
END
) y
) d
GROUP BY [Matched], Grouper
GO
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply