Comparing and Deleting repeating values in semicolon delimited string

Question

Comparing and Deleting repeating values in semicolon delimited string

apat

SSCertifiable

Points: 5929
More actions
January 12, 2010 at 12:49 pm

#208942

Hi All,
I have a table with 10 columns. One of the columns has records which include semicolon delimited string of characters. e.x. T-000007309-000B; T-000007309-000C; T-000007309-000B
there could be only 1 or more strings starting with T. In example above there are 3 of them, they could be 1 or they could be even 15. I would like to compare this whole record and if any of such string (starting with T) is repeating again in the same list then I would like to delete that duplicate in this particular colum for this particular record. So for example above I would like to remove
T-000007309-000B since it is repeating and update this record so it has only T-000007309-000B; T-000007309-000C.
My table has 100K + records. Can anyone please help?
Thanks a lot in advance...

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

John Rowan SSC Guru Points: 56440 More actions · Answer 1

This is an easy task for a tally table with a split function. Do you by chance have a numbers table or split function already in your DB?

If not, here's the link to Jeff Moden's article that contains instructions on how to create the Tally table (dbo.Tally):

http://www.sqlservercentral.com/articles/T-SQL/62867/

Here's my split function:

IF OBJECT_ID('dbo.udf_Split') IS NOT NULL

DROP FUNCTION dbo.udf_Split

GO

CREATE FUNCTION dbo.udf_Split(

@ItemList varchar(max),

@Delimiter varchar(10))

RETURNS @IDTable TABLE (ItemNumber int, Item varchar(max))

/*============================================================

Author: JJR

Purpose: Splits delimited list into a table variable.

Modification History

04/03/2009 JJR - Created

============================================================*/

AS

BEGIN

INSERT INTO @IDTable

SELECT ROW_NUMBER() OVER(ORDER BY N),

SUBSTRING(@ItemList+@Delimiter, N,CHARINDEX(@Delimiter, @ItemList+@Delimiter, N) - N)

FROM dbo.Tally

WHERE N <= LEN(@ItemList)

AND SUBSTRING(@Delimiter + @ItemList,N, LEN(@Delimiter)) = @Delimiter

ORDER BY N

RETURN

END

And finally, here's some sample code on how this ties together with your problem.

DECLARE @List varchar(max),

@NewList varchar(max)

SET @List = 'T-000007309-000B; T-000007309-000C; T-000007309-000B'

SELECT @List

SET @NewList = ''

SELECT @NewList = @NewList + LTRIM(Item) + ';'

FROM dbo.udf_Split(@List, ';')

GROUP BY LTRIM(Item)

SELECT @NewList

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

John Rowan SSC Guru Points: 56440 More actions · Answer 2

By the way, this is the quick and dirty example. I realize that your data is in a column in a table, which makes my variable based exmaple not as relevent. I'll have a bit of time after a while to re-do the code to work with a column in a table, but that should help get you going in the right direction for now.

Sorry for the confusion.

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

apat SSCertifiable Points: 5929 More actions · Answer 3

apat

SSCertifiable

Points: 5929

January 12, 2010 at 1:13 pm

#1102708

Thanks Much John...

John Rowan SSC Guru Points: 56440 More actions · Answer 4

OK, that was quicker than I thought. I created a scalar function to be applied to your column. There may be a better way to do this performance wise, but for a table of that size, this may not be too bad. So here's a function that used the split function that I posted above.

IF OBJECT_ID('dbo.udf_SplitColumn') IS NOT NULL

DROP FUNCTION dbo.udf_SplitColumn

GO

CREATE FUNCTION dbo.udf_SplitColumn(

@ItemList varchar(max),

@Delimiter varchar(10))

RETURNS varchar(max)

AS

BEGIN

DECLARE @NewList varchar(max)

SET @NewList = ''

SELECT @NewList = @NewList + LTRIM(Item) + ';'

FROM dbo.udf_Split(@ItemList, @Delimiter)

GROUP BY LTRIM(Item)

RETURN (@NewList)

END

And here's some code that uses that to get your results.

DECLARE @Table TABLE (Col1 varchar(max))

INSERT INTO @Table

SELECT 'T-000007309-000B; T-000007309-000C; T-000007309-000B'

SELECT dbo.udf_SplitColumn(Col1,';')

FROM @Table

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

apat SSCertifiable Points: 5929 More actions · Answer 5

apat

SSCertifiable

Points: 5929

January 12, 2010 at 1:18 pm

#1102714

Yup this works... great thanks very much John...

John Rowan SSC Guru Points: 56440 More actions · Answer 6

No problem. I would not be suprised to find that you get some other solutions from some of the other regulars here on SSC. I cringe at the idea of nesting a table valued function within a scalar function and applying that to a column, but given the time I have to prepare something, I at least know this will work.

You may want to copy your table into a test environment and test this stuff out there for performance. Much of how this performs will be based on how long the string lists are within the column in question.

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 7

Does the order of the data in the record matter?

Edit -- Never mind. I got sidetracked. My development servers finally came back up and I had work to complete.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

Here are two queries which will do the trick.

Both use a tally table to transform the segments within the column into rows, then use a GROUP BY

to dedupe. To transform the rows back into a single column, one query uses an "old-style" CASE construct,

the other uses PIVOT.

I've tested this against a sample table containing over 100,000 rows of which about 10% contain 15 separate values

in the column, and the SELECT time in each case is between 1 and 2 seconds.

Very little modification is required to convert either SELECT query into an UPDATE...FROM.

Sample data:

-- Dedupe the content of a column containing delimited values

-- make over 100,000 rows of sample data with up to 15 values in the target column

DROP TABLE #Sample

CREATE TABLE #Sample (DodgyStringData VARCHAR(300))

INSERT INTO #Sample (DodgyStringData)

SELECT 'T-000007309-000A; T-000007309-000B; T-000007309-000C; T-000007309-000D; T-000007309-000E; T-000007309-000F; T-000007309-000G; T-000007309-000H; T-000007309-000I; T-000007309-000J; T-000007309-000K; T-000007309-000L; T-000007309-000M; T-000007309-000N; T-000007309-000O' UNION ALL

SELECT 'T-000007309-000B; T-000007309-000C; T-000007309-000B; T-000007309-000B; T-000007309-000B; T-000007309-000B; T-000007309-000B' UNION ALL

SELECT 'T-000007309-000A; T-000007309-000B; T-000007309-000C' UNION ALL

SELECT 'T-000007309-000A; T-000007309-000C; T-000007309-000B' UNION ALL

SELECT 'T-000007309-000A; T-000007309-000B; T-000007309-000B' UNION ALL

SELECT 'T-000007309-000A; T-000007309-000A; T-000007309-000A' UNION ALL

SELECT 'T-000007309-002A; T-000007309-001A; T-000007309-000A' UNION ALL

SELECT 'T-7309-000B; T-07309-000C; T-007309-000B' UNION ALL

SELECT 'T-000007309-000B; T-000007309-000B; T-000007309-000B' UNION ALL

SELECT 'T-000007309-000C; T-000007309-000C; T-000007309-000C'

INSERT INTO #Sample SELECT a.* FROM #Sample a, #Sample b, #Sample c, #Sample d, #Sample e

Pivot query:

-- run the PIVOT SELECT query (2 seconds for 100,010 rows)

SELECT s.DodgyStringData,

LessDodgyStringData = LEFT(x.LessDodgyStringData, LEN(x.LessDodgyStringData)-1)

FROM #Sample s

CROSS APPLY ( -- x: reassemble (pivot) segments

SELECT LessDodgyStringData =

ISNULL([1] + '; ', '') + ISNULL([2] + '; ', '') + ISNULL([3] + '; ', '') +

ISNULL([4] + '; ', '') + ISNULL([5] + '; ', '') + ISNULL([6] + '; ', '') +

ISNULL([7] + '; ', '') + ISNULL([8] + '; ', '') + ISNULL([9] + '; ', '') +

ISNULL([10] + '; ', '') + ISNULL([11] + '; ', '') + ISNULL([12] + '; ', '') +

ISNULL([13] + '; ', '') + ISNULL([14] + '; ', '') + ISNULL([15] + '; ', '')

FROM ( -- SourceTable: resolve segments as rows, and dedupe

SELECT ColNo = ROW_NUMBER() OVER(ORDER BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)),

SUBSTRING(s.DodgyStringData, d.b, d.e-d.b) AS OneDedupedVal

FROM ( -- d: gather delimiter positions

SELECT n.n AS b, -- delimiter at beginning of string

ISNULL(NULLIF(CHARINDEX('; ', s.DodgyStringData, n.n+1), 0), LEN(s.DodgyStringData)+1) AS e -- delimiter at end of string

FROM (SELECT TOP 300 [n] = ROW_NUMBER() OVER(ORDER BY [name]) FROM master.dbo.syscolumns) n

WHERE n.n <= LEN(s.DodgyStringData) AND SUBSTRING('; '+s.DodgyStringData+'; ', n.n, 1) = '; '

) d GROUP BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)

) AS SourceTable

PIVOT ( MAX(OneDedupedVal)

FOR ColNo IN ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15])

) AS PivotTable

) x

-- / run the PIVOT SELECT query (2 seconds for 100,010 rows)

Traditional query:

-- run the traditional SELECT query (2 seconds for 100,010 rows)

SELECT s.DodgyStringData,

LessDodgyStringData = LEFT(x.LessDodgyStringData, LEN(x.LessDodgyStringData)-1)

FROM #Sample s

CROSS APPLY ( -- x: reassemble (unpivot) segments

SELECT LessDodgyStringData =

ISNULL(MAX(CASE f.ColNo WHEN 1 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 2 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 3 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 4 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 5 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 6 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 7 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 8 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 9 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 10 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 11 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 12 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 13 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 14 THEN OneDedupedVal + '; ' END), '') +

ISNULL(MAX(CASE f.ColNo WHEN 15 THEN OneDedupedVal + '; ' END), '')

FROM ( -- f: resolve segments and dedupe

SELECT ColNo = ROW_NUMBER() OVER(ORDER BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)),

SUBSTRING(s.DodgyStringData, d.b, d.e-d.b) AS OneDedupedVal

FROM ( -- d: gather delimiter positions

SELECT n.n AS b, -- delimiter at beginning of string

ISNULL(NULLIF(CHARINDEX('; ', s.DodgyStringData, n.n+1), 0), LEN(s.DodgyStringData)+1) AS e -- delimiter at end of string

FROM (SELECT TOP 300 [n] = ROW_NUMBER() OVER(ORDER BY [name]) FROM master.dbo.syscolumns) n

WHERE n.n <= LEN(s.DodgyStringData) AND SUBSTRING('; '+s.DodgyStringData+'; ', n.n, 1) = '; '

) d GROUP BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)

) f

) x

-- / run the traditional SELECT query (2 seconds for 100,010 rows)

Cheers

ChrisM

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden