Remove duplicate substrings from string

Question

Remove duplicate substrings from string

mickey w

Old Hand

Points: 369
More actions
April 24, 2018 at 1:47 am

#344495

Hello experts, I have a problem I'd appreciate advice on please.
I'm asked to determine distinct combinations of sequences of values from a field which contains delimited values. My issue is that I need to remove instances of duplicated strings, while preserving sort order.
Examples of the input and output are given in the table:
create table #stringsplit (str_raw varchar(30), str_output varchar(30))
insert into #stringsplit (str_raw, str_output)
values ('A > A > A', 'A') , ('A > A > A > A > C', 'A > C') , ('A > A > A > A > C > A', 'A > C > A') , ('A > A > A > A > C > B', 'A > C > B') , ('D > B > B > D > C', 'D > B > D > C')
I've seen examples of user defined functions where users parse and reconstruct strings, but not where the order of occurrence of the substring is preserved.
Thanks
Mickey

Viewing 11 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Thom A SSC Guru Points: 99143 More actions · Answer 1

One idea:

WITH Grps AS ( SELECT SS.str_raw, DS.Item, DS.ItemNumber, ROW_NUMBER() OVER (PARTITION BY SS.str_raw ORDER BY DS.ItemNumber) - ROW_NUMBER() OVER (PARTITION BY SS.str_raw, DS.Item ORDER BY DS.ItemNumber) AS Grp FROM #stringsplit SS CROSS APPLY DelimitedSplit8K (REPLACE(SS.str_raw,' ',''), '>') DS), Distinctions AS ( SELECT DISTINCT str_raw, Item, Grp FROM Grps) SELECT STUFF(REPLACE((SELECT ' > ' + Item FROM Distinctions sq WHERE sq.str_raw = D.str_raw ORDER BY Grp FOR XML PATH ('')),'>','>'),1,3,'') FROM Distinctions D GROUP BY D.str_raw;

Note this uses jeff Mode's Delimited Split 8K; as STRING_SPLIT is only available on SQL Server 2016+.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

mickey w Old Hand Points: 369 More actions · Answer 2

Thanks Thom, I used the DelimitedSplit8K described by Jeff Moden - http://www.sqlservercentral.com/articles/Tally+Table/72993/

I get the results below:

A
A > C
A > A > C
A > C > B
D > B > D > C

Note the third row, which I'd want to be:

A > C > A

All other rows look good, but this one is out of sequence, can anyone describe why?
Thanks
Mickey

Thom A SSC Guru Points: 99143 More actions · Answer 3

mickey w - Tuesday, April 24, 2018 2:58 AM
Thanks Thom, I used the DelimitedSplit8K described by Jeff Moden - http://www.sqlservercentral.com/articles/Tally+Table/72993/
I get the results below:
A
A > C
A > A > C
A > C > B
D > B > D > C
Note the third row, which I'd want to be:
A > C > A
All other rows look good, but this one is out of sequence, can anyone describe why?
Thanks
Mickey

Give this a go:

WITH Grps AS ( SELECT SS.str_raw, DS.Item, DS.ItemNumber, ROW_NUMBER() OVER (PARTITION BY SS.str_raw ORDER BY DS.ItemNumber) - ROW_NUMBER() OVER (PARTITION BY SS.str_raw, DS.Item ORDER BY DS.ItemNumber) AS Grp FROM #stringsplit SS CROSS APPLY DelimitedSplit8K (REPLACE(SS.str_raw,' ',''), '>') DS), Distinctions AS ( SELECT str_raw, Item, GRP, MIN(ItemNumber) AS ItemNumber FROM Grps GROUP BY str_raw, Item, Grp) SELECT D.str_raw, STUFF(REPLACE((SELECT ' > ' + Item FROM Distinctions sq WHERE sq.str_raw = D.str_raw ORDER BY ItemNumber FOR XML PATH ('')),'>','>'),1,3,'') FROM Distinctions D GROUP BY D.str_raw;

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

mickey w Old Hand Points: 369 More actions · Answer 4

fantastic, thanks Thom this looks to have solved it.

much appreciated.

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 5

For fun, here is a more efficient way (I think 😉 not tested)
😎

USE TEEST;
GO
SET NOCOUNT ON;
create table #stringsplit (str_raw varchar(30), str_output varchar(30))
insert into #stringsplit (str_raw, str_output)
values ('A > A > A', 'A')
, ('A > A > A > A > C', 'A > C')
, ('A > A > A > A > C > A', 'A > C > A')
, ('A > A > A > A > C > B', 'A > C > B')
, ('D > B > B > D > C', 'D > B > D > C')
;WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
,SAMPLE_DATA AS 
(
  SELECT 
   ROW_NUMBER() OVER (ORDER BY @@VERSION) AS STR_RID
   ,S.str_raw
  FROM #stringsplit S
)
,GROUPED_SET AS
(
SELECT 
  SS.STR_RID
 ,SS.str_raw + ' >' AS INP_STR
 ,NUMS.N
 ,SUBSTRING(SS.str_raw + ' > ',(NUMS.N * 4) -3, 4) AS GRP_STR
FROM SAMPLE_DATA SS
CROSS APPLY 
(
  SELECT TOP((LEN(SS.str_raw) + 3) / 4 )
   ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N 
  FROM T T1,T T2,T T3,T T4
) 
NUMS(N)
)
,COMP_SET AS 
(
SELECT
 GS.STR_RID
 ,GS.GRP_STR
 ,LEAD(GS.GRP_STR,1,NULL) OVER 
   (
    PARTITION BY GS.STR_RID
    ORDER BY  GS.N 
   ) AS CMP_STR
 ,GS.N
FROM GROUPED_SET  GS
)
,OUTPUT_SET AS
(
  SELECT 
  CS.STR_RID
  ,(
   SELECT 
    CONCAT
      (
       ''
       ,CASE 
        WHEN CSS.GRP_STR = CSS.CMP_STR THEN ''
        ELSE CSS.GRP_STR
       END 
      )
   FROM  COMP_SET   CSS
   WHERE CSS.STR_RID = CS.STR_RID
   FOR XML PATH (''), TYPE 
  ).value('(./text())[1]','VARCHAR(MAX)') AS OUT_STR_RAW
  FROM  COMP_SET  CS
  GROUP BY CS.STR_RID
)
SELECT
  OS.STR_RID
 ,LTRIM(STUFF(OS.OUT_STR_RAW,LEN(OS.OUT_STR_RAW) - 1,2,'')) AS FINAL_STR
FROM OUTPUT_SET OS;
DROP TABLE #stringsplit;

Output
1 A 2 A > C 3 A > C > A 4 A > C > B 5 D > B > D > C

Thom A SSC Guru Points: 99143 More actions · Answer 6

Interesting idea Eirikur. If I read your code correctly, however, this does assume that the value between the the Greater Than symbols is only 1 character.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 7

Thom A - Tuesday, April 24, 2018 9:03 AM
Interesting idea Eirikur. If I read your code correctly, however, this does assume that the value between the the Greater Than symbols is only 1 character.

You are correct Tom, the sample data only indicates one space character between the GT's and the values.
😎

If the sample data does not accurately reflect the characteristics of the real data, then this can easily be adjusted, I'm not in the business of assuming or guessing😉

In this case, the two major factors, when it comes to performance, are the iterations in the delimiter detection part and the iterations in the reconstruction of the output string. Since the sample data indicates that this is a fixed width entry, we can reduce the number of iterations in the former by the length of each entry, here that is 1/4.

Another factors are the plan simplification, the reduction of blocking (sort) operators from 7 to 1, on a large sets, this makes a lot of difference, the number of joins from 11 to 9 and the reduction of the total number of operators from approx. 67 to 48.

Although the batch execution in the execution plan is not always too accurate, the difference here is 9:1 on my old laptop.

Alan Burstein SSC Guru Points: 61152 More actions · Answer 8

Another really fast way to tackle this (using NGrams8k - link is in my signature):

SELECT * FROM #stringsplit s CROSS APPLY ( SELECT STUFF(REPLACE(( SELECT ' > ' + token FROM (VALUES (s.str_raw)) s(str_raw) CROSS APPLY (VALUES (REPLACE(s.str_raw, ' > ', ''))) clean(string) CROSS APPLY dbo.ngrams8k(clean.string,1) WHERE patindex('%[^'+token+']%', substring(clean.string, position+1, 8000)) = 1 OR position = len(clean.string) ORDER BY position FOR XML PATH('')),'>','>'),1,3,'') ) new(string);

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 9

Alan.B - Tuesday, April 24, 2018 1:03 PM
Another really fast way to tackle this (using NGrams8k - link is in my signature):
SELECT * FROM #stringsplit s CROSS APPLY ( SELECT STUFF(REPLACE(( SELECT ' > ' + token FROM (VALUES (s.str_raw)) s(str_raw) CROSS APPLY (VALUES (REPLACE(s.str_raw, ' > ', ''))) clean(string) CROSS APPLY dbo.ngrams8k(clean.string,1) WHERE patindex('%[^'+token+']%', substring(clean.string, position+1, 8000)) = 1 OR position = len(clean.string) ORDER BY position FOR XML PATH('')),'>','>'),1,3,'') ) new(string);

Very nice Alan, had totally forgotten about your nasty fast NGrams functions, this is far faster than my code.
😎

ScottPletcher SSC Guru Points: 101238 More actions · Answer 10

Eirikur Eiriksson - Tuesday, April 24, 2018 11:54 PM
Alan.B - Tuesday, April 24, 2018 1:03 PM
Another really fast way to tackle this (using NGrams8k - link is in my signature):
SELECT * FROM #stringsplit s CROSS APPLY ( SELECT STUFF(REPLACE(( SELECT ' > ' + token FROM (VALUES (s.str_raw)) s(str_raw) CROSS APPLY (VALUES (REPLACE(s.str_raw, ' > ', ''))) clean(string) CROSS APPLY dbo.ngrams8k(clean.string,1) WHERE patindex('%[^'+token+']%', substring(clean.string, position+1, 8000)) = 1 OR position = len(clean.string) ORDER BY position FOR XML PATH('')),'>','>'),1,3,'') ) new(string);
Very nice Alan, had totally forgotten about your nasty fast NGrams functions, this is far faster than my code.
😎

Agree, very nice indeed.
But I think, based on my reading, that the WHERE can be simplified to get rid of the patindex() call:
...
WHERE substring(clean.string, position+1, 1) <> token ...
...

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".