Split list of string into all possible combinations

Question

Split list of string into all possible combinations

N_Muller

SSCarpal Tunnel

Points: 4037
More actions
November 29, 2016 at 2:36 pm

#319348

I have a table that contains one column and is populated as follows:
declare @table table ( Col varchar(100) )
insert @table ( Col )
values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )
I'm in need of a query that will split every row that has multiple values into all combinations, and return a distinct list of combinations. For the example above, I need to return:
Each individual value (unique):
----------------------------------
'1'
'5'
'21'
'32'
'97'
'115'
'173'
Groupings of two values:
---------------------------
'1,32'
'1,173'
'32,173'
'5,32'
'21,32'
'54,115'
Groupings of three values:
-----------------------------
'1,32,173'
In summary, for every row, all possible combinations; that is, for row '1,32,173', I need: '1', '32', '173', '1,32', '1,173', '32,173' and '1,32,173'. Any help is greatly appreciated.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Alan Burstein SSC Guru Points: 61141 More actions · Answer 1

My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):

-- your sample data

declare @table table ( Col varchar(100) )

insert @table (Col)

values ('173'), ('97'), ('54,115' ), ('1,32,173'), ('21,32'), ('5,32'), ('32,173');

-- solution:

WITH trips AS

(

SELECT col, item=item+0

FROM @table

CROSS APPLY dbo.DelimitedSplit8K(col,',') s1

WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

)

-- All combinations when there's 3

SELECT a.col, item = CAST(a.item AS varchar(10))+ ',' +CAST(b.item AS varchar(10))

FROM trips a, trips b

WHERE a.col = b.col AND a.item < b.item

UNION ALL

SELECT col, col

FROM @table

WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

UNION ALL

SELECT col, col

FROM @table

WHERE LEN(col)-1 = LEN(REPLACE(col,',',''))

UNION ALL

SELECT col, item

FROM @table

CROSS APPLY dbo.DelimitedSplit8K(col,',')

ORDER BY col -- not required, will slow you down. Included for demo purposes only!

You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.

edit: added some explanation.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

N_Muller SSCarpal Tunnel Points: 4037 More actions · Answer 2

Alan.B (11/29/2016)
My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):
You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.
edit: added some explanation.

I actually ended up finding a different solution. Not sure it's faster, but it works nicely. First, I added an identity column to my table variable, but that's not required. The query is as follows:

declare @table table ( IndexID int identity(1,1), Col varchar(100) )

insert @table ( Col )

values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )

-- solution

;with Z as

(

selectRowID = Y.IndexID,

ColID = V.ID,

Col = V.ListValues

from@table Y cross apply

FN_ListToTable_8K ( Y.Col, ',' ) V

), W as

(

selectRowID,

NoCols = max(ColID)

fromZ

group by RowID

), R as

(

selectID = 1,

RowID,

Col,

LastCol = convert(int,Col)

fromZ

union all

selectID = R.ID + 1,

RowID = R.RowID,

Col = R.Col + ',' + Z.Col,

LastCol = convert(int,Z.Col)

fromW inner join

R on

R.RowID = W.RowID and

R.ID < W.NoCols inner join

Z on

Z.RowID = R.RowID and

convert(int,Z.Col) > R.LastCol

)

select distinct Col from R order by Col

-- The first CTE transforms the strings into rows while keeping the reference to the original RowID (IndexID in this case).

-- The second CTE gets the number of elements in each original RowID.

-- The third is a recursive CTE where I keep track of the last element added to the comma delimited list (largest value) and loops for the largest number of elements in each original RowID.

N_Muller SSCarpal Tunnel Points: 4037 More actions · Answer 3

N_Muller (11/30/2016)
Alan.B (11/29/2016)
My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):
You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.
edit: added some explanation.
I actually ended up finding a different solution. Not sure it's faster, but it works nicely. First, I added an identity column to my table variable, but that's not required. The query is as follows:

Now in a way that one can actually read the code :w00t:

declare @table table ( IndexID int identity(1,1), Col varchar(100) )

insert @table ( Col )

values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )

;with Z as

(

selectRowID = Y.IndexID,

ColID = V.ID,

Col = V.ListValues

from@table Y cross apply

FN_ListToTable_8K ( Y.Col, ',' ) V

), W as

(

selectRowID,

NoCols = max(ColID)

fromZ

group by RowID

), R as

(

selectID = 1,

RowID,

Col,

LastCol = convert(int,Col)

fromZ

union all

selectID = R.ID + 1,

RowID = R.RowID,

Col = R.Col + ',' + Z.Col,

LastCol = convert(int,Z.Col)

fromW inner join

R on

R.RowID = W.RowID and

R.ID < W.NoCols inner join

Z on

Z.RowID = R.RowID and

convert(int,Z.Col) > R.LastCol

)

selectdistinct Col from R order by Col

Alan Burstein SSC Guru Points: 61141 More actions · Answer 4

N_Muller (11/30/2016)
Alan.B (11/29/2016)
My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):
You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.
edit: added some explanation.
I actually ended up finding a different solution. Not sure it's faster, but it works nicely. First, I added an identity column to my table variable, but that's not required. The query is as follows:
declare @table table ( IndexID int identity(1,1), Col varchar(100) )
insert @table ( Col )
values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )
-- solution
;with Z as
(
selectRowID = Y.IndexID,
ColID = V.ID,
Col = V.ListValues
from@table Y cross apply
FN_ListToTable_8K ( Y.Col, ',' ) V
), W as
(
selectRowID,
NoCols = max(ColID)
fromZ
group by RowID
), R as
(
selectID = 1,
RowID,
Col,
LastCol = convert(int,Col)
fromZ
union all
selectID = R.ID + 1,
RowID = R.RowID,
Col = R.Col + ',' + Z.Col,
LastCol = convert(int,Z.Col)
fromW inner join
R on
R.RowID = W.RowID and
R.ID < W.NoCols inner join
Z on
Z.RowID = R.RowID and
convert(int,Z.Col) > R.LastCol
)
select distinct Col from R order by Col
-- The first CTE transforms the strings into rows while keeping the reference to the original RowID (IndexID in this case).
-- The second CTE gets the number of elements in each original RowID.
-- The third is a recursive CTE where I keep track of the last element added to the comma delimited list (largest value) and loops for the largest number of elements in each original RowID.

First, I noticed an issue with my solution where I was getting duplicates, here's an updated version:

declare @table table (IndexID int identity(1,1), Col varchar(100));

insert @table (Col)

values ('173'), ('97'), ('54,115' ), ('1,32,173'), ('21,32'), ('5,32'), ('32,173');

WITH trips AS

(

SELECT col, item=item+0

FROM @table

CROSS APPLY dbo.DelimitedSplit8K(col,',') s1

WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

)

-- All combinations when there's 3

SELECT item = CAST(a.item AS varchar(10))+ ',' +CAST(b.item AS varchar(10))

FROM trips a, trips b

WHERE a.col = b.col AND a.item < b.item

UNION ALL

SELECT col

FROM @table

WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

UNION ALL

SELECT col

FROM @table

WHERE LEN(col)-1 = LEN(REPLACE(col,',',''))

UNION

SELECT item

FROM @table

CROSS APPLY dbo.DelimitedSplit8K(col,',')

ORDER BY item; -- not required.

Next, if FN_ListToTable_8K is not an Inline Table Valued function then you need to change that. A multi-line Table Valued splitter function will KILL your performance. The solution you are using includes a recursive CTE - those can be costly too. Here's a performance test to compare techniques:

SET NOCOUNT ON;

-- Sample data:

declare @table table (IndexID int identity(1,1), Col varchar(100));

WITH parts(n,p) AS

(

SELECT TOP (10000)

'1'+REPLICATE('0',(abs(checksum(newid())%3)+1)), abs(checksum(newid())%3)+1

FROM sys.all_columns

)

INSERT @table

SELECT

CASE p

WHEN 1 THEN cast((abs(checksum(newid())%n)+1) AS varchar(4))

WHEN 2 THEN cast((abs(checksum(newid())%n)+1) AS varchar(4))+','+

cast((abs(checksum(newid())%n)+1) AS varchar(4))

ELSE cast((abs(checksum(newid())%n)+1) AS varchar(4))+','+

cast((abs(checksum(newid())%n)+1) AS varchar(4))+','+

cast((abs(checksum(newid())%n)+1) AS varchar(4))

END

FROM parts;

--SELECT * FROM @table

DBCC TRACEON(2453); -- for better cardinality estimates on temp variables

SET STATISTICS IO ON;

SET STATISTICS TIME ON;

PRINT 'MY SOLUTION: '+char(13)+char(10)+REPLICATE('-',50);

WITH trips AS

(

SELECT col, item=item+0

FROM @table

CROSS APPLY dbo.DelimitedSplit8K(col,',') s1

WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

)

-- All combinations when there's 3

SELECT item = CAST(a.item AS varchar(10))+ ',' +CAST(b.item AS varchar(10))

FROM trips a, trips b

WHERE a.col = b.col AND a.item < b.item

UNION ALL

SELECT col

FROM @table

WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

UNION ALL

SELECT col

FROM @table

WHERE LEN(col)-1 = LEN(REPLACE(col,',',''))

UNION

SELECT item

FROM @table

CROSS APPLY dbo.DelimitedSplit8K(col,',')

--ORDER BY item

PRINT 'RCTE:'+char(13)+char(10)+REPLICATE('-',50);

;with Z as

(

selectRowID = Y.IndexID,

ColID = V.ItemNumber,

Col = V.Item

from@table Y cross APPLY dbo.DelimitedSplit8K ( Y.Col, ',' ) V

), W as

(

selectRowID,

NoCols = max(ColID)

fromZ

group by RowID

), R as

(

selectID = 1,

RowID,

Col,

LastCol = convert(int,Col)

fromZ

union all

selectID = R.ID + 1,

RowID = R.RowID,

Col = R.Col + ',' + Z.Col,

LastCol = convert(int,Z.Col)

fromW inner join

R on

R.RowID = W.RowID and

R.ID < W.NoCols inner join

Z on

Z.RowID = R.RowID and

convert(int,Z.Col) > R.LastCol

)

SELECT DISTINCT Col

FROM R

--ORDER BY col;

SET STATISTICS TIME OFF;

SET STATISTICS IO OFF;

DBCC TRACEOFF(2453);

Perf Test results:

MY SOLUTION:

--------------------------------------------------

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

SQL Server parse and compile time:

CPU time = 31 ms, elapsed time = 32 ms.

Table '#A73A2F66'. Scan count 25, logical reads 120, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 3458, logical reads 98032, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 547 ms, elapsed time = 427 ms.

RCTE:

--------------------------------------------------

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

SQL Server parse and compile time:

CPU time = 16 ms, elapsed time = 21 ms.

Table 'Worktable'. Scan count 58811, logical reads 392249, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#A73A2F66'. Scan count 3, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 1029 ms, elapsed time = 1193 ms.

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

You can see - even with a good splitter the recursive CTE solution is less than half as fast and produces 4X the number of reads.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 5

N_Muller (11/30/2016)
Alan.B (11/29/2016)
My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):
You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.
edit: added some explanation.
I actually ended up finding a different solution. Not sure it's faster, but it works nicely. First, I added an identity column to my table variable, but that's not required. The query is as follows:
declare @table table ( IndexID int identity(1,1), Col varchar(100) )
insert @table ( Col )
values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )
-- solution
;with Z as
(
selectRowID = Y.IndexID,
ColID = V.ID,
Col = V.ListValues
from@table Y cross apply
FN_ListToTable_8K ( Y.Col, ',' ) V
), W as
(
selectRowID,
NoCols = max(ColID)
fromZ
group by RowID
), R as
(
selectID = 1,
RowID,
Col,
LastCol = convert(int,Col)
fromZ
union all
selectID = R.ID + 1,
RowID = R.RowID,
Col = R.Col + ',' + Z.Col,
LastCol = convert(int,Z.Col)
fromW inner join
R on
R.RowID = W.RowID and
R.ID < W.NoCols inner join
Z on
Z.RowID = R.RowID and
convert(int,Z.Col) > R.LastCol
)
select distinct Col from R order by Col
-- The first CTE transforms the strings into rows while keeping the reference to the original RowID (IndexID in this case).
-- The second CTE gets the number of elements in each original RowID.
-- The third is a recursive CTE where I keep track of the last element added to the comma delimited list (largest value) and loops for the largest number of elements in each original RowID.

Can you post a copy of the FN_ListToTable_8K function so that we can try out your code? Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

N_Muller SSCarpal Tunnel Points: 4037 More actions · Answer 6

Jeff Moden (12/6/2016)Can you post a copy of the FN_ListToTable_8K function so that we can try out your code? Thanks.

CREATE FUNCTION [dbo].[FN_ListToTable_8K]

(

@listVARCHAR(8000),

@delimiterCHAR(1)

)

RETURNS TABLE WITH SCHEMABINDING

AS

RETURN

--

-- Thist function has had minimal if any modification on code

-- by Jeff Moden (SQL Server Central)

--

-- CTE Driven "Tally Table" produces values from 1 up to 10,000...

-- enough to cover VARCHAR(8000)

WITH E1(N) AS

(

SELECT 1 UNION ALL

SELECT 1

), --10E+1 or 10 rows

E2(N) AS

(

SELECT1

FROME1 a,

E1 b

),--10E+2 or 100 rows

E4(N) AS

(

SELECT1

FROME2 a,

E2 b

),--10E+4 or 10,000 rows

cteTally(N) AS

(

--

--This provides the "base" CTE and limits the number of rows right up front

--for both a performance gain and prevention of accidental "overruns"

--

SELECTTOP

(ISNULL(DATALENGTH(@list),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROME4

),

cteStart(N1) AS

(

--

--This returns N+1 (starting position of each "element" just once for each delimiter)

--

SELECT1

UNION ALL

SELECTt.N+1

FROMcteTally t

WHERESUBSTRING(@list,t.N,1) = @delimiter

),

cteLen(N1,L1) AS

(

--

--Return start and length (for use in substring)

--

SELECTs.N1,

ISNULL(NULLIF(CHARINDEX(@delimiter,@list,s.N1),0)-s.N1,8000)

FROMcteStart s

)

--

--Do the actual split. The ISNULL/NULLIF combo handles the length for the final

--element when no delimiter is found.

--

SELECTID= ROW_NUMBER() OVER(ORDER BY l.N1),

ListValues= SUBSTRING(@list, l.N1, l.L1)

FROMcteLen l

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 7

N_Muller (12/6/2016)
Jeff Moden (12/6/2016)Can you post a copy of the FN_ListToTable_8K function so that we can try out your code? Thanks.
CREATE FUNCTION [dbo].[FN_ListToTable_8K]
(
@listVARCHAR(8000),
@delimiterCHAR(1)
)
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
--
-- Thist function has had minimal if any modification on code
-- by Jeff Moden (SQL Server Central)
--
-- CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS
(
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1 UNION ALL
SELECT 1
), --10E+1 or 10 rows
E2(N) AS
(
SELECT1
FROME1 a,
E1 b
),--10E+2 or 100 rows
E4(N) AS
(
SELECT1
FROME2 a,
E2 b
),--10E+4 or 10,000 rows
cteTally(N) AS
(
--
--This provides the "base" CTE and limits the number of rows right up front
--for both a performance gain and prevention of accidental "overruns"
--
SELECTTOP
(ISNULL(DATALENGTH(@list),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROME4
),
cteStart(N1) AS
(
--
--This returns N+1 (starting position of each "element" just once for each delimiter)
--
SELECT1
UNION ALL
SELECTt.N+1
FROMcteTally t
WHERESUBSTRING(@list,t.N,1) = @delimiter
),
cteLen(N1,L1) AS
(
--
--Return start and length (for use in substring)
--
SELECTs.N1,
ISNULL(NULLIF(CHARINDEX(@delimiter,@list,s.N1),0)-s.N1,8000)
FROMcteStart s
)
--
--Do the actual split. The ISNULL/NULLIF combo handles the length for the final
--element when no delimiter is found.
--
SELECTID= ROW_NUMBER() OVER(ORDER BY l.N1),
ListValues= SUBSTRING(@list, l.N1, l.L1)
FROMcteLen l

I'm thinking that code is going to work pretty well for you. 😀 Thanks for the mention in the comments.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004340 More actions · Answer 8

Shifting gears back to the problem at hand, will there ever be CSVs with more than 3 elements?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

N_Muller SSCarpal Tunnel Points: 4037 More actions · Answer 9

Jeff Moden (12/6/2016)
Shifting gears back to the problem at hand, will there ever be CSVs with more than 3 elements?

Yes, but the most I've seen in my scenarios is about 20.

N_Muller SSCarpal Tunnel Points: 4037 More actions · Answer 10

Jeff Moden (12/6/2016)
I'm thinking that code is going to work pretty well for you. 😀 Thanks for the mention in the comments.

I give credit, where credit is due 🙂