January 12, 2010 at 12:49 pm
Hi All,
I have a table with 10 columns. One of the columns has records which include semicolon delimited string of characters. e.x. T-000007309-000B; T-000007309-000C; T-000007309-000B
there could be only 1 or more strings starting with T. In example above there are 3 of them, they could be 1 or they could be even 15. I would like to compare this whole record and if any of such string (starting with T) is repeating again in the same list then I would like to delete that duplicate in this particular colum for this particular record. So for example above I would like to remove
T-000007309-000B since it is repeating and update this record so it has only T-000007309-000B; T-000007309-000C.
My table has 100K + records. Can anyone please help?
Thanks a lot in advance...
January 12, 2010 at 1:01 pm
This is an easy task for a tally table with a split function. Do you by chance have a numbers table or split function already in your DB?
If not, here's the link to Jeff Moden's article that contains instructions on how to create the Tally table (dbo.Tally):
http://www.sqlservercentral.com/articles/T-SQL/62867/
Here's my split function:
IF OBJECT_ID('dbo.udf_Split') IS NOT NULL
DROP FUNCTION dbo.udf_Split
GO
CREATE FUNCTION dbo.udf_Split(
@ItemList varchar(max),
@Delimiter varchar(10))
RETURNS @IDTable TABLE (ItemNumber int, Item varchar(max))
/*============================================================
Author: JJR
Purpose: Splits delimited list into a table variable.
Modification History
04/03/2009 JJR - Created
============================================================*/
AS
BEGIN
INSERT INTO @IDTable
SELECT ROW_NUMBER() OVER(ORDER BY N),
SUBSTRING(@ItemList+@Delimiter, N,CHARINDEX(@Delimiter, @ItemList+@Delimiter, N) - N)
FROM dbo.Tally
WHERE N <= LEN(@ItemList)
AND SUBSTRING(@Delimiter + @ItemList,N, LEN(@Delimiter)) = @Delimiter
ORDER BY N
RETURN
END
And finally, here's some sample code on how this ties together with your problem.
DECLARE @List varchar(max),
@NewList varchar(max)
SET @List = 'T-000007309-000B; T-000007309-000C; T-000007309-000B'
SELECT @List
SET @NewList = ''
SELECT @NewList = @NewList + LTRIM(Item) + ';'
FROM dbo.udf_Split(@List, ';')
GROUP BY LTRIM(Item)
SELECT @NewList
January 12, 2010 at 1:06 pm
By the way, this is the quick and dirty example. I realize that your data is in a column in a table, which makes my variable based exmaple not as relevent. I'll have a bit of time after a while to re-do the code to work with a column in a table, but that should help get you going in the right direction for now.
Sorry for the confusion.
January 12, 2010 at 1:13 pm
Thanks Much John...
January 12, 2010 at 1:14 pm
OK, that was quicker than I thought. I created a scalar function to be applied to your column. There may be a better way to do this performance wise, but for a table of that size, this may not be too bad. So here's a function that used the split function that I posted above.
IF OBJECT_ID('dbo.udf_SplitColumn') IS NOT NULL
DROP FUNCTION dbo.udf_SplitColumn
GO
CREATE FUNCTION dbo.udf_SplitColumn(
@ItemList varchar(max),
@Delimiter varchar(10))
RETURNS varchar(max)
AS
BEGIN
DECLARE @NewList varchar(max)
SET @NewList = ''
SELECT @NewList = @NewList + LTRIM(Item) + ';'
FROM dbo.udf_Split(@ItemList, @Delimiter)
GROUP BY LTRIM(Item)
RETURN (@NewList)
END
And here's some code that uses that to get your results.
DECLARE @Table TABLE (Col1 varchar(max))
INSERT INTO @Table
SELECT 'T-000007309-000B; T-000007309-000C; T-000007309-000B'
SELECT dbo.udf_SplitColumn(Col1,';')
FROM @Table
January 12, 2010 at 1:18 pm
Yup this works... great thanks very much John...
January 12, 2010 at 1:21 pm
No problem. I would not be suprised to find that you get some other solutions from some of the other regulars here on SSC. I cringe at the idea of nesting a table valued function within a scalar function and applying that to a column, but given the time I have to prepare something, I at least know this will work.
You may want to copy your table into a test environment and test this stuff out there for performance. Much of how this performs will be based on how long the string lists are within the column in question.
January 12, 2010 at 1:42 pm
Does the order of the data in the record matter?
Edit -- Never mind. I got sidetracked. My development servers finally came back up and I had work to complete.
January 13, 2010 at 6:18 am
Here are two queries which will do the trick.
Both use a tally table to transform the segments within the column into rows, then use a GROUP BY
to dedupe. To transform the rows back into a single column, one query uses an "old-style" CASE construct,
the other uses PIVOT.
I've tested this against a sample table containing over 100,000 rows of which about 10% contain 15 separate values
in the column, and the SELECT time in each case is between 1 and 2 seconds.
Very little modification is required to convert either SELECT query into an UPDATE...FROM.
Sample data:
-- Dedupe the content of a column containing delimited values
-- make over 100,000 rows of sample data with up to 15 values in the target column
DROP TABLE #Sample
CREATE TABLE #Sample (DodgyStringData VARCHAR(300))
INSERT INTO #Sample (DodgyStringData)
SELECT 'T-000007309-000A; T-000007309-000B; T-000007309-000C; T-000007309-000D; T-000007309-000E; T-000007309-000F; T-000007309-000G; T-000007309-000H; T-000007309-000I; T-000007309-000J; T-000007309-000K; T-000007309-000L; T-000007309-000M; T-000007309-000N; T-000007309-000O' UNION ALL
SELECT 'T-000007309-000B; T-000007309-000C; T-000007309-000B; T-000007309-000B; T-000007309-000B; T-000007309-000B; T-000007309-000B' UNION ALL
SELECT 'T-000007309-000A; T-000007309-000B; T-000007309-000C' UNION ALL
SELECT 'T-000007309-000A; T-000007309-000C; T-000007309-000B' UNION ALL
SELECT 'T-000007309-000A; T-000007309-000B; T-000007309-000B' UNION ALL
SELECT 'T-000007309-000A; T-000007309-000A; T-000007309-000A' UNION ALL
SELECT 'T-000007309-002A; T-000007309-001A; T-000007309-000A' UNION ALL
SELECT 'T-7309-000B; T-07309-000C; T-007309-000B' UNION ALL
SELECT 'T-000007309-000B; T-000007309-000B; T-000007309-000B' UNION ALL
SELECT 'T-000007309-000C; T-000007309-000C; T-000007309-000C'
INSERT INTO #Sample SELECT a.* FROM #Sample a, #Sample b, #Sample c, #Sample d, #Sample e
Pivot query:
-- run the PIVOT SELECT query (2 seconds for 100,010 rows)
SELECT s.DodgyStringData,
LessDodgyStringData = LEFT(x.LessDodgyStringData, LEN(x.LessDodgyStringData)-1)
FROM #Sample s
CROSS APPLY ( -- x: reassemble (pivot) segments
SELECT LessDodgyStringData =
ISNULL([1] + '; ', '') + ISNULL([2] + '; ', '') + ISNULL([3] + '; ', '') +
ISNULL([4] + '; ', '') + ISNULL([5] + '; ', '') + ISNULL([6] + '; ', '') +
ISNULL([7] + '; ', '') + ISNULL([8] + '; ', '') + ISNULL([9] + '; ', '') +
ISNULL([10] + '; ', '') + ISNULL([11] + '; ', '') + ISNULL([12] + '; ', '') +
ISNULL([13] + '; ', '') + ISNULL([14] + '; ', '') + ISNULL([15] + '; ', '')
FROM ( -- SourceTable: resolve segments as rows, and dedupe
SELECT ColNo = ROW_NUMBER() OVER(ORDER BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)),
SUBSTRING(s.DodgyStringData, d.b, d.e-d.b) AS OneDedupedVal
FROM ( -- d: gather delimiter positions
SELECT n.n AS b, -- delimiter at beginning of string
ISNULL(NULLIF(CHARINDEX('; ', s.DodgyStringData, n.n+1), 0), LEN(s.DodgyStringData)+1) AS e -- delimiter at end of string
FROM (SELECT TOP 300 [n] = ROW_NUMBER() OVER(ORDER BY [name]) FROM master.dbo.syscolumns) n
WHERE n.n <= LEN(s.DodgyStringData) AND SUBSTRING('; '+s.DodgyStringData+'; ', n.n, 1) = '; '
) d GROUP BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)
) AS SourceTable
PIVOT ( MAX(OneDedupedVal)
FOR ColNo IN ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15])
) AS PivotTable
) x
-- / run the PIVOT SELECT query (2 seconds for 100,010 rows)
Traditional query:
-- run the traditional SELECT query (2 seconds for 100,010 rows)
SELECT s.DodgyStringData,
LessDodgyStringData = LEFT(x.LessDodgyStringData, LEN(x.LessDodgyStringData)-1)
FROM #Sample s
CROSS APPLY ( -- x: reassemble (unpivot) segments
SELECT LessDodgyStringData =
ISNULL(MAX(CASE f.ColNo WHEN 1 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 2 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 3 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 4 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 5 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 6 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 7 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 8 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 9 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 10 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 11 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 12 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 13 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 14 THEN OneDedupedVal + '; ' END), '') +
ISNULL(MAX(CASE f.ColNo WHEN 15 THEN OneDedupedVal + '; ' END), '')
FROM ( -- f: resolve segments and dedupe
SELECT ColNo = ROW_NUMBER() OVER(ORDER BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)),
SUBSTRING(s.DodgyStringData, d.b, d.e-d.b) AS OneDedupedVal
FROM ( -- d: gather delimiter positions
SELECT n.n AS b, -- delimiter at beginning of string
ISNULL(NULLIF(CHARINDEX('; ', s.DodgyStringData, n.n+1), 0), LEN(s.DodgyStringData)+1) AS e -- delimiter at end of string
FROM (SELECT TOP 300 [n] = ROW_NUMBER() OVER(ORDER BY [name]) FROM master.dbo.syscolumns) n
WHERE n.n <= LEN(s.DodgyStringData) AND SUBSTRING('; '+s.DodgyStringData+'; ', n.n, 1) = '; '
) d GROUP BY SUBSTRING(s.DodgyStringData, d.b, d.e-d.b)
) f
) x
-- / run the traditional SELECT query (2 seconds for 100,010 rows)
Cheers
ChrisM
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply