Find The Similar Matching String

Question

Find The Similar Matching String

shohelr2003

Hall of Fame

Points: 3535
More actions
July 13, 2013 at 11:35 pm

#277416

Dear,
I have a customer table containing many duplicate names and custid is the unique key. But the customer names are not 100% similar. For example,
CustIDCustName
--------------------
100ABC CO
101ABC CO.
102ABC CO&
103ABC CO,NY
Here all the custname refers to similar customer. Now I wanna delete the duplicate names. I need to search custnames which are 80% similar and delete those customers.
Please help me to do this.
Regards,
Akbar

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 1

shohelr2003 (7/13/2013)
Dear,
I have a customer table containing many duplicate names and custid is the unique key. But the customer names are not 100% similar. For example,
CustIDCustName
--------------------
100ABC CO
101ABC CO.
102ABC CO&
103ABC CO,NY
Here all the custname refers to similar customer. Now I wanna delete the duplicate names. I need to search custnames which are 80% similar and delete those customers.
Please help me to do this.
Regards,
Akbar

I have a suspicion that you have over simplified your data,,,,,,does the company name always come first as you have described?

do you have any other customer related columns ( telephone/zipcode/address etc) that could help with mapping duplicates?

what is the scale of the problem,,,,ie no of rows?

how do you define 80% match?

regards

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

Erland Sommarskog SSC-Insane Points: 24038 More actions · Answer 2

This is a non-trivial problem, and a T-SQL only solution is impossible. If you are on SQL 2012, you should have a look at Data Quality Services. If you are on SQL 2008, the Data Profiling task in SSIS can help. (But then again, that is when you load the data, not when already have in the table).

[font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

Steven Willis SSCrazy Eights Points: 9893 More actions · Answer 3

See article by Dwain Camps 2012/11/29

http://www.sqlservercentral.com/articles/String+Manipulation/94365/[/url]

His article on pattern matching may provide a solution. Based on your limited example, I came up with this

script that works on that limited set (the pattern split function is posted below). I can't guarantee that

it's a universal solution to the problem.

Sample data

IF OBJECT_ID('tempdb..#TempTable') IS NOT NULL

DROP TABLE #TempTable

CREATE TABLE #TempTable

(

ID INT IDENTITY(1,1) NOT NULL

,CustID INT NOT NULL

,CustName VARCHAR(1000) NULL

,PRIMARY KEY CLUSTERED (CustID)

)

;WITH cteSampleData (CustID,CustName)

AS

(

SELECT 100,'ABC CO' UNION ALL

SELECT 101,'ABC CO.' UNION ALL

SELECT 102,'ABC CO&' UNION ALL

SELECT 103,'ABC CO,NY' UNION ALL

SELECT 104,'XYZ Corp' UNION ALL

SELECT 105,'XYZ Corporation' UNION ALL

SELECT 106,'XYZ Corp CA'

)

INSERT INTO #TempTable

(CustID,CustName)

SELECT

CustID,

CustName

FROM

cteSampleData

Script (requires function PatternSplitCM...see below)

SELECT

CustID

,BaseName

,Dupe

FROM

(

SELECT

ROW_NUMBER() OVER (PARTITION BY ps.Item ORDER BY t.ID) AS rowNum

,MAX(LEN(t.CustName)) OVER (PARTITION BY t.CustName) AS maxLen

,t.CustID

,ps.Item AS BaseName

,t.CustName as Dupe

FROM

#TempTable AS t

CROSS APPLY

dbo.PatternSplitCM(t.CustName,'[A-Za-z0-9]') AS ps

WHERE

ps.ItemNumber = 1

) r

WHERE

rowNum > 1

AND LEN(Dupe) >= maxLen

Output:

CustIDBaseNameDupe

102ABCABC CO&

103ABCABC CO,NY

101ABCABC CO.

106XYZXYZ Corp CA

105XYZXYZ Corporation

/* See article by Dwain Camps 2012/11/29 */

/* http://www.sqlservercentral.com/articles/String+Manipulation/94365/ */

CREATE FUNCTION [dbo].[PatternSplitCM] (@List VARCHAR(8000) = NULL, @Pattern VARCHAR(50))

RETURNS TABLE

WITH SCHEMABINDING

AS

RETURN

WITH numbers AS (

SELECT TOP (ISNULL(DATALENGTH(@List), 0)) n = ROW_NUMBER() OVER (

ORDER BY (

SELECT NULL

)

FROM (

VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)

) d(n), (

VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)

) e(n), (

VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)

) f(n), (

VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0)

) g(n)

)

SELECT ItemNumber = ROW_NUMBER() OVER (

ORDER BY MIN(n)

), Item = SUBSTRING(@List, MIN(n), 1 + MAX(n) - MIN(n)), [Matched]

FROM (

SELECT n, y.[Matched], Grouper = n - ROW_NUMBER() OVER (

ORDER BY y.[Matched], n

)

FROM numbers

CROSS APPLY (

SELECT [Matched] = CASE

WHEN SUBSTRING(@List, n, 1) LIKE @Pattern

THEN 1

ELSE 0

END

) y

) d

GROUP BY [Matched], Grouper

GO