Comparing and Deleting repeating values in semicolon delimited string

  • Hi All,

    I have a table with 10 columns. One of the columns has records which include semicolon delimited string of characters. e.x. T-000007309-000B; T-000007309-000C; T-000007309-000B

    there could be only 1 or more strings starting with T. In example above there are 3 of them, they could be 1 or they could be even 15. I would like to compare this whole record and if any of such string (starting with T) is repeating again in the same list then I would like to delete that duplicate in this particular colum for this particular record. So for example above I would like to remove

    T-000007309-000B since it is repeating and update this record so it has only T-000007309-000B; T-000007309-000C.

    My table has 100K + records. Can anyone please help?

    Thanks a lot in advance...

  • This is an easy task for a tally table with a split function. Do you by chance have a numbers table or split function already in your DB?

    If not, here's the link to Jeff Moden's article that contains instructions on how to create the Tally table (dbo.Tally):

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    Here's my split function:

    IF OBJECT_ID('dbo.udf_Split') IS NOT NULL

    DROP FUNCTION dbo.udf_Split

    GO

    CREATE FUNCTION dbo.udf_Split(

    @ItemList varchar(max),

    @Delimiter varchar(10))

    RETURNS @IDTable TABLE (ItemNumber int, Item varchar(max))

    /*============================================================

    Author: JJR

    Purpose: Splits delimited list into a table variable.

    Modification History

    04/03/2009 JJR - Created

    ============================================================*/

    AS

    BEGIN

    INSERT INTO @IDTable

    SELECT ROW_NUMBER() OVER(ORDER BY N),

    SUBSTRING(@ItemList+@Delimiter, N,CHARINDEX(@Delimiter, @ItemList+@Delimiter, N) - N)

    FROM dbo.Tally

    WHERE N <= LEN(@ItemList)

    AND SUBSTRING(@Delimiter + @ItemList,N, LEN(@Delimiter)) = @Delimiter

    ORDER BY N

    RETURN

    END

    And finally, here's some sample code on how this ties together with your problem.

    DECLARE @List varchar(max),

    @NewList varchar(max)

    SET @List = 'T-000007309-000B; T-000007309-000C; T-000007309-000B'

    SELECT @List

    SET @NewList = ''

    SELECT @NewList = @NewList + LTRIM(Item) + ';'

    FROM dbo.udf_Split(@List, ';')

    GROUP BY LTRIM(Item)

    SELECT @NewList

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • By the way, this is the quick and dirty example. I realize that your data is in a column in a table, which makes my variable based exmaple not as relevent. I'll have a bit of time after a while to re-do the code to work with a column in a table, but that should help get you going in the right direction for now.

    Sorry for the confusion.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Thanks Much John...

  • OK, that was quicker than I thought. I created a scalar function to be applied to your column. There may be a better way to do this performance wise, but for a table of that size, this may not be too bad. So here's a function that used the split function that I posted above.

    IF OBJECT_ID('dbo.udf_SplitColumn') IS NOT NULL

    DROP FUNCTION dbo.udf_SplitColumn

    GO

    CREATE FUNCTION dbo.udf_SplitColumn(

    @ItemList varchar(max),

    @Delimiter varchar(10))

    RETURNS varchar(max)

    AS

    BEGIN

    DECLARE @NewList varchar(max)

    SET @NewList = ''

    SELECT @NewList = @NewList + LTRIM(Item) + ';'

    FROM dbo.udf_Split(@ItemList, @Delimiter)

    GROUP BY LTRIM(Item)

    RETURN (@NewList)

    END

    And here's some code that uses that to get your results.

    DECLARE @Table TABLE (Col1 varchar(max))

    INSERT INTO @Table

    SELECT 'T-000007309-000B; T-000007309-000C; T-000007309-000B'

    SELECT dbo.udf_SplitColumn(Col1,';')

    FROM @Table

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Yup this works... great thanks very much John...

  • No problem. I would not be suprised to find that you get some other solutions from some of the other regulars here on SSC. I cringe at the idea of nesting a table valued function within a scalar function and applying that to a column, but given the time I have to prepare something, I at least know this will work.

    You may want to copy your table into a test environment and test this stuff out there for performance. Much of how this performs will be based on how long the string lists are within the column in question.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Does the order of the data in the record matter?

    Edit -- Never mind. I got sidetracked. My development servers finally came back up and I had work to complete.

  • Here are two queries which will do the trick.

    Both use a tally table to transform the segments within the column into rows, then use a GROUP BY

    to dedupe. To transform the rows back into a single column, one query uses an "old-style" CASE construct,

    the other uses PIVOT.

    I've tested this against a sample table containing over 100,000 rows of which about 10% contain 15 separate values

    in the column, and the SELECT time in each case is between 1 and 2 seconds.

    Very little modification is required to convert either SELECT query into an UPDATE...FROM.

    Sample data:

    -- Dedupe the content of a column containing delimited values

    -- make over 100,000 rows of sample data with up to 15 values in the target column

    DROP TABLE #Sample

    CREATE TABLE #Sample (DodgyStringData VARCHAR(300))

    INSERT INTO #Sample (DodgyStringData)

    SELECT 'T-000007309-000A; T-000007309-000B; T-000007309-000C; T-000007309-000D; T-000007309-000E; T-000007309-000F; T-000007309-000G; T-000007309-000H; T-000007309-000I; T-000007309-000J; T-000007309-000K; T-000007309-000L; T-000007309-000M; T-000007309-000N; T-000007309-000O' UNION ALL

    SELECT 'T-000007309-000B; T-000007309-000C; T-000007309-000B; T-000007309-000B; T-000007309-000B; T-000007309-000B; T-000007309-000B' UNION ALL

    SELECT 'T-000007309-000A; T-000007309-000B; T-000007309-000C' UNION ALL

    SELECT 'T-000007309-000A; T-000007309-000C; T-000007309-000B' UNION ALL

    SELECT 'T-000007309-000A; T-000007309-000B; T-000007309-000B' UNION ALL

    SELECT 'T-000007309-000A; T-000007309-000A; T-000007309-000A' UNION ALL

    SELECT 'T-000007309-002A; T-000007309-001A; T-000007309-000A' UNION ALL

    SELECT 'T-7309-000B; T-07309-000C; T-007309-000B' UNION ALL

    SELECT 'T-000007309-000B; T-000007309-000B; T-000007309-000B' UNION ALL

    SELECT 'T-000007309-000C; T-000007309-000C; T-000007309-000C'

    INSERT INTO #Sample SELECT a.* FROM #Sample a, #Sample b, #Sample c, #Sample d, #Sample e

    Pivot query:

    -- run the PIVOT SELECT query (2 seconds for 100,010 rows)

    SELECT s.DodgyStringData,

    LessDodgyStringData = LEFT(x.LessDodgyStringData, LEN(x.LessDodgyStringData)-1)

    FROM #Sample s

    CROSS APPLY ( -- x: reassemble (pivot) segments

    SELECT LessDodgyStringData =

    ISNULL([1] + '; ', '') + ISNULL([2] + '; ', '') + ISNULL([3] + '; ', '') +

    ISNULL([4] + '; ', '') + ISNULL([5] + '; ', '') + ISNULL([6] + '; ', '') +

    ISNULL([7] + '; ', '') + ISNULL([8] + '; ', '') + ISNULL([9] + '; ', '') +

    ISNULL([10] + '; ', '') + ISNULL([11] + '; ', '') + ISNULL([12] + '; ', '') +

    ISNULL([13] + '; ', '') + ISNULL([14] + '; ', '') + ISNULL([15] + '; ', '')

    FROM ( -- SourceTable: resolve segments as rows, and dedupe

    SELECT ColNo = ROW_NUMBER() OVER(ORDER BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)),

    SUBSTRING(s.DodgyStringData, d.b, d.e-d.b) AS OneDedupedVal

    FROM ( -- d: gather delimiter positions

    SELECT n.n AS b, -- delimiter at beginning of string

    ISNULL(NULLIF(CHARINDEX('; ', s.DodgyStringData, n.n+1), 0), LEN(s.DodgyStringData)+1) AS e -- delimiter at end of string

    FROM (SELECT TOP 300 [n] = ROW_NUMBER() OVER(ORDER BY [name]) FROM master.dbo.syscolumns) n

    WHERE n.n <= LEN(s.DodgyStringData) AND SUBSTRING('; '+s.DodgyStringData+'; ', n.n, 1) = '; '

    ) d GROUP BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)

    ) AS SourceTable

    PIVOT ( MAX(OneDedupedVal)

    FOR ColNo IN ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15])

    ) AS PivotTable

    ) x

    -- / run the PIVOT SELECT query (2 seconds for 100,010 rows)

    Traditional query:

    -- run the traditional SELECT query (2 seconds for 100,010 rows)

    SELECT s.DodgyStringData,

    LessDodgyStringData = LEFT(x.LessDodgyStringData, LEN(x.LessDodgyStringData)-1)

    FROM #Sample s

    CROSS APPLY ( -- x: reassemble (unpivot) segments

    SELECT LessDodgyStringData =

    ISNULL(MAX(CASE f.ColNo WHEN 1 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 2 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 3 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 4 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 5 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 6 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 7 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 8 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 9 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 10 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 11 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 12 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 13 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 14 THEN OneDedupedVal + '; ' END), '') +

    ISNULL(MAX(CASE f.ColNo WHEN 15 THEN OneDedupedVal + '; ' END), '')

    FROM ( -- f: resolve segments and dedupe

    SELECT ColNo = ROW_NUMBER() OVER(ORDER BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)),

    SUBSTRING(s.DodgyStringData, d.b, d.e-d.b) AS OneDedupedVal

    FROM ( -- d: gather delimiter positions

    SELECT n.n AS b, -- delimiter at beginning of string

    ISNULL(NULLIF(CHARINDEX('; ', s.DodgyStringData, n.n+1), 0), LEN(s.DodgyStringData)+1) AS e -- delimiter at end of string

    FROM (SELECT TOP 300 [n] = ROW_NUMBER() OVER(ORDER BY [name]) FROM master.dbo.syscolumns) n

    WHERE n.n <= LEN(s.DodgyStringData) AND SUBSTRING('; '+s.DodgyStringData+'; ', n.n, 1) = '; '

    ) d GROUP BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)

    ) f

    ) x

    -- / run the traditional SELECT query (2 seconds for 100,010 rows)

    Cheers

    ChrisM

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply