April 24, 2018 at 1:47 am
Hello experts, I have a problem I'd appreciate advice on please.
I'm asked to determine distinct combinations of sequences of values from a field which contains delimited values. My issue is that I need to remove instances of duplicated strings, while preserving sort order.
Examples of the input and output are given in the table:
create table #stringsplit (str_raw varchar(30), str_output varchar(30))
insert into #stringsplit (str_raw, str_output)
values ('A > A > A', 'A')
, ('A > A > A > A > C', 'A > C')
, ('A > A > A > A > C > A', 'A > C > A')
, ('A > A > A > A > C > B', 'A > C > B')
, ('D > B > B > D > C', 'D > B > D > C')
I've seen examples of user defined functions where users parse and reconstruct strings, but not where the order of occurrence of the substring is preserved.
Thanks
Mickey
April 24, 2018 at 2:10 am
One idea:
WITH Grps AS (
SELECT SS.str_raw,
DS.Item,
DS.ItemNumber,
ROW_NUMBER() OVER (PARTITION BY SS.str_raw ORDER BY DS.ItemNumber) -
ROW_NUMBER() OVER (PARTITION BY SS.str_raw, DS.Item ORDER BY DS.ItemNumber) AS Grp
FROM #stringsplit SS
CROSS APPLY DelimitedSplit8K (REPLACE(SS.str_raw,' ',''), '>') DS),
Distinctions AS (
SELECT DISTINCT str_raw, Item, Grp
FROM Grps)
SELECT STUFF(REPLACE((SELECT ' > ' + Item
FROM Distinctions sq
WHERE sq.str_raw = D.str_raw
ORDER BY Grp
FOR XML PATH ('')),'>','>'),1,3,'')
FROM Distinctions D
GROUP BY D.str_raw;
Note this uses jeff Mode's Delimited Split 8K; as STRING_SPLIT is only available on SQL Server 2016+.
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
April 24, 2018 at 2:58 am
Thanks Thom, I used the DelimitedSplit8K described by Jeff Moden - http://www.sqlservercentral.com/articles/Tally+Table/72993/
I get the results below:
A
A > C
A > A > C
A > C > B
D > B > D > C
Note the third row, which I'd want to be:
A > C > A
All other rows look good, but this one is out of sequence, can anyone describe why?
Thanks
Mickey
April 24, 2018 at 3:33 am
mickey w - Tuesday, April 24, 2018 2:58 AMThanks Thom, I used the DelimitedSplit8K described by Jeff Moden - http://www.sqlservercentral.com/articles/Tally+Table/72993/I get the results below:
A
A > C
A > A > C
A > C > B
D > B > D > CNote the third row, which I'd want to be:
A > C > A
All other rows look good, but this one is out of sequence, can anyone describe why?
Thanks
Mickey
Give this a go:
WITH Grps AS (
SELECT SS.str_raw,
DS.Item,
DS.ItemNumber,
ROW_NUMBER() OVER (PARTITION BY SS.str_raw ORDER BY DS.ItemNumber) -
ROW_NUMBER() OVER (PARTITION BY SS.str_raw, DS.Item ORDER BY DS.ItemNumber) AS Grp
FROM #stringsplit SS
CROSS APPLY DelimitedSplit8K (REPLACE(SS.str_raw,' ',''), '>') DS),
Distinctions AS (
SELECT str_raw, Item, GRP,
MIN(ItemNumber) AS ItemNumber
FROM Grps
GROUP BY str_raw, Item, Grp)
SELECT D.str_raw,
STUFF(REPLACE((SELECT ' > ' + Item
FROM Distinctions sq
WHERE sq.str_raw = D.str_raw
ORDER BY ItemNumber
FOR XML PATH ('')),'>','>'),1,3,'')
FROM Distinctions D
GROUP BY D.str_raw;
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
April 24, 2018 at 4:31 am
fantastic, thanks Thom this looks to have solved it.
much appreciated.
April 24, 2018 at 8:55 am
For fun, here is a more efficient way (I think 😉 not tested)
😎
Output
1 A
2 A > C
3 A > C > A
4 A > C > B
5 D > B > D > C
April 24, 2018 at 10:23 am
Thom A - Tuesday, April 24, 2018 9:03 AMInteresting idea Eirikur. If I read your code correctly, however, this does assume that the value between the the Greater Than symbols is only 1 character.
You are correct Tom, the sample data only indicates one space character between the GT's and the values.
😎
If the sample data does not accurately reflect the characteristics of the real data, then this can easily be adjusted, I'm not in the business of assuming or guessing😉
In this case, the two major factors, when it comes to performance, are the iterations in the delimiter detection part and the iterations in the reconstruction of the output string. Since the sample data indicates that this is a fixed width entry, we can reduce the number of iterations in the former by the length of each entry, here that is 1/4.
Another factors are the plan simplification, the reduction of blocking (sort) operators from 7 to 1, on a large sets, this makes a lot of difference, the number of joins from 11 to 9 and the reduction of the total number of operators from approx. 67 to 48.
Although the batch execution in the execution plan is not always too accurate, the difference here is 9:1 on my old laptop.
April 24, 2018 at 1:03 pm
Another really fast way to tackle this (using NGrams8k - link is in my signature):
SELECT *
FROM #stringsplit s
CROSS APPLY
(
SELECT STUFF(REPLACE((
SELECT ' > ' + token
FROM (VALUES (s.str_raw)) s(str_raw)
CROSS APPLY (VALUES (REPLACE(s.str_raw, ' > ', ''))) clean(string)
CROSS APPLY dbo.ngrams8k(clean.string,1)
WHERE patindex('%[^'+token+']%', substring(clean.string, position+1, 8000)) = 1
OR position = len(clean.string)
ORDER BY position
FOR XML PATH('')),'>','>'),1,3,'')
) new(string);
-- Itzik Ben-Gan 2001
April 24, 2018 at 11:54 pm
Alan.B - Tuesday, April 24, 2018 1:03 PMAnother really fast way to tackle this (using NGrams8k - link is in my signature):
SELECT *
FROM #stringsplit s
CROSS APPLY
(
SELECT STUFF(REPLACE((
SELECT ' > ' + token
FROM (VALUES (s.str_raw)) s(str_raw)
CROSS APPLY (VALUES (REPLACE(s.str_raw, ' > ', ''))) clean(string)
CROSS APPLY dbo.ngrams8k(clean.string,1)
WHERE patindex('%[^'+token+']%', substring(clean.string, position+1, 8000)) = 1
OR position = len(clean.string)
ORDER BY position
FOR XML PATH('')),'>','>'),1,3,'')
) new(string);
Very nice Alan, had totally forgotten about your nasty fast NGrams functions, this is far faster than my code.
😎
February 7, 2019 at 3:27 pm
Eirikur Eiriksson - Tuesday, April 24, 2018 11:54 PMAlan.B - Tuesday, April 24, 2018 1:03 PMAnother really fast way to tackle this (using NGrams8k - link is in my signature):
SELECT *
FROM #stringsplit s
CROSS APPLY
(
SELECT STUFF(REPLACE((
SELECT ' > ' + token
FROM (VALUES (s.str_raw)) s(str_raw)
CROSS APPLY (VALUES (REPLACE(s.str_raw, ' > ', ''))) clean(string)
CROSS APPLY dbo.ngrams8k(clean.string,1)
WHERE patindex('%[^'+token+']%', substring(clean.string, position+1, 8000)) = 1
OR position = len(clean.string)
ORDER BY position
FOR XML PATH('')),'>','>'),1,3,'')
) new(string);Very nice Alan, had totally forgotten about your nasty fast NGrams functions, this is far faster than my code.
😎
Agree, very nice indeed.
But I think, based on my reading, that the WHERE can be simplified to get rid of the patindex() call:
...
WHERE substring(clean.string, position+1, 1) <> token ...
...
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply