Split list of string into all possible combinations

  • I have a table that contains one column and is populated as follows:

    declare @table table ( Col varchar(100) )

    insert @table ( Col )

    values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )

    I'm in need of a query that will split every row that has multiple values into all combinations, and return a distinct list of combinations. For the example above, I need to return:

    Each individual value (unique):

    ----------------------------------

    '1'

    '5'

    '21'

    '32'

    '97'

    '115'

    '173'

    Groupings of two values:

    ---------------------------

    '1,32'

    '1,173'

    '32,173'

    '5,32'

    '21,32'

    '54,115'

    Groupings of three values:

    -----------------------------

    '1,32,173'

    In summary, for every row, all possible combinations; that is, for row '1,32,173', I need: '1', '32', '173', '1,32', '1,173', '32,173' and '1,32,173'. Any help is greatly appreciated.

  • My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):

    -- your sample data

    declare @table table ( Col varchar(100) )

    insert @table (Col)

    values ('173'), ('97'), ('54,115' ), ('1,32,173'), ('21,32'), ('5,32'), ('32,173');

    -- solution:

    WITH trips AS

    (

    SELECT col, item=item+0

    FROM @table

    CROSS APPLY dbo.DelimitedSplit8K(col,',') s1

    WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

    )

    -- All combinations when there's 3

    SELECT a.col, item = CAST(a.item AS varchar(10))+ ',' +CAST(b.item AS varchar(10))

    FROM trips a, trips b

    WHERE a.col = b.col AND a.item < b.item

    UNION ALL

    SELECT col, col

    FROM @table

    WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

    UNION ALL

    SELECT col, col

    FROM @table

    WHERE LEN(col)-1 = LEN(REPLACE(col,',',''))

    UNION ALL

    SELECT col, item

    FROM @table

    CROSS APPLY dbo.DelimitedSplit8K(col,',')

    ORDER BY col -- not required, will slow you down. Included for demo purposes only!

    You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.

    edit: added some explanation.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (11/29/2016)


    My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):

    You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.

    edit: added some explanation.

    I actually ended up finding a different solution. Not sure it's faster, but it works nicely. First, I added an identity column to my table variable, but that's not required. The query is as follows:

    declare @table table ( IndexID int identity(1,1), Col varchar(100) )

    insert @table ( Col )

    values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )

    -- solution

    ;with Z as

    (

    selectRowID = Y.IndexID,

    ColID = V.ID,

    Col = V.ListValues

    from@table Y cross apply

    FN_ListToTable_8K ( Y.Col, ',' ) V

    ), W as

    (

    selectRowID,

    NoCols = max(ColID)

    fromZ

    group by RowID

    ), R as

    (

    selectID = 1,

    RowID,

    Col,

    LastCol = convert(int,Col)

    fromZ

    union all

    selectID = R.ID + 1,

    RowID = R.RowID,

    Col = R.Col + ',' + Z.Col,

    LastCol = convert(int,Z.Col)

    fromW inner join

    R on

    R.RowID = W.RowID and

    R.ID < W.NoCols inner join

    Z on

    Z.RowID = R.RowID and

    convert(int,Z.Col) > R.LastCol

    )

    select distinct Col from R order by Col

    -- The first CTE transforms the strings into rows while keeping the reference to the original RowID (IndexID in this case).

    -- The second CTE gets the number of elements in each original RowID.

    -- The third is a recursive CTE where I keep track of the last element added to the comma delimited list (largest value) and loops for the largest number of elements in each original RowID.

  • N_Muller (11/30/2016)


    Alan.B (11/29/2016)


    My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):

    You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.

    edit: added some explanation.

    I actually ended up finding a different solution. Not sure it's faster, but it works nicely. First, I added an identity column to my table variable, but that's not required. The query is as follows:

    Now in a way that one can actually read the code :w00t:

    declare @table table ( IndexID int identity(1,1), Col varchar(100) )

    insert @table ( Col )

    values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )

    ;with Z as

    (

    selectRowID = Y.IndexID,

    ColID = V.ID,

    Col = V.ListValues

    from@table Y cross apply

    FN_ListToTable_8K ( Y.Col, ',' ) V

    ), W as

    (

    selectRowID,

    NoCols = max(ColID)

    fromZ

    group by RowID

    ), R as

    (

    selectID = 1,

    RowID,

    Col,

    LastCol = convert(int,Col)

    fromZ

    union all

    selectID = R.ID + 1,

    RowID = R.RowID,

    Col = R.Col + ',' + Z.Col,

    LastCol = convert(int,Z.Col)

    fromW inner join

    R on

    R.RowID = W.RowID and

    R.ID < W.NoCols inner join

    Z on

    Z.RowID = R.RowID and

    convert(int,Z.Col) > R.LastCol

    )

    selectdistinct Col from R order by Col

  • N_Muller (11/30/2016)


    Alan.B (11/29/2016)


    My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):

    You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.

    edit: added some explanation.

    I actually ended up finding a different solution. Not sure it's faster, but it works nicely. First, I added an identity column to my table variable, but that's not required. The query is as follows:

    declare @table table ( IndexID int identity(1,1), Col varchar(100) )

    insert @table ( Col )

    values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )

    -- solution

    ;with Z as

    (

    selectRowID = Y.IndexID,

    ColID = V.ID,

    Col = V.ListValues

    from@table Y cross apply

    FN_ListToTable_8K ( Y.Col, ',' ) V

    ), W as

    (

    selectRowID,

    NoCols = max(ColID)

    fromZ

    group by RowID

    ), R as

    (

    selectID = 1,

    RowID,

    Col,

    LastCol = convert(int,Col)

    fromZ

    union all

    selectID = R.ID + 1,

    RowID = R.RowID,

    Col = R.Col + ',' + Z.Col,

    LastCol = convert(int,Z.Col)

    fromW inner join

    R on

    R.RowID = W.RowID and

    R.ID < W.NoCols inner join

    Z on

    Z.RowID = R.RowID and

    convert(int,Z.Col) > R.LastCol

    )

    select distinct Col from R order by Col

    -- The first CTE transforms the strings into rows while keeping the reference to the original RowID (IndexID in this case).

    -- The second CTE gets the number of elements in each original RowID.

    -- The third is a recursive CTE where I keep track of the last element added to the comma delimited list (largest value) and loops for the largest number of elements in each original RowID.

    First, I noticed an issue with my solution where I was getting duplicates, here's an updated version:

    declare @table table (IndexID int identity(1,1), Col varchar(100));

    insert @table (Col)

    values ('173'), ('97'), ('54,115' ), ('1,32,173'), ('21,32'), ('5,32'), ('32,173');

    WITH trips AS

    (

    SELECT col, item=item+0

    FROM @table

    CROSS APPLY dbo.DelimitedSplit8K(col,',') s1

    WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

    )

    -- All combinations when there's 3

    SELECT item = CAST(a.item AS varchar(10))+ ',' +CAST(b.item AS varchar(10))

    FROM trips a, trips b

    WHERE a.col = b.col AND a.item < b.item

    UNION ALL

    SELECT col

    FROM @table

    WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

    UNION ALL

    SELECT col

    FROM @table

    WHERE LEN(col)-1 = LEN(REPLACE(col,',',''))

    UNION

    SELECT item

    FROM @table

    CROSS APPLY dbo.DelimitedSplit8K(col,',')

    ORDER BY item; -- not required.

    Next, if FN_ListToTable_8K is not an Inline Table Valued function then you need to change that. A multi-line Table Valued splitter function will KILL your performance. The solution you are using includes a recursive CTE - those can be costly too. Here's a performance test to compare techniques:

    SET NOCOUNT ON;

    -- Sample data:

    declare @table table (IndexID int identity(1,1), Col varchar(100));

    WITH parts(n,p) AS

    (

    SELECT TOP (10000)

    '1'+REPLICATE('0',(abs(checksum(newid())%3)+1)), abs(checksum(newid())%3)+1

    FROM sys.all_columns

    )

    INSERT @table

    SELECT

    CASE p

    WHEN 1 THEN cast((abs(checksum(newid())%n)+1) AS varchar(4))

    WHEN 2 THEN cast((abs(checksum(newid())%n)+1) AS varchar(4))+','+

    cast((abs(checksum(newid())%n)+1) AS varchar(4))

    ELSE cast((abs(checksum(newid())%n)+1) AS varchar(4))+','+

    cast((abs(checksum(newid())%n)+1) AS varchar(4))+','+

    cast((abs(checksum(newid())%n)+1) AS varchar(4))

    END

    FROM parts;

    --SELECT * FROM @table

    DBCC TRACEON(2453); -- for better cardinality estimates on temp variables

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    PRINT 'MY SOLUTION: '+char(13)+char(10)+REPLICATE('-',50);

    WITH trips AS

    (

    SELECT col, item=item+0

    FROM @table

    CROSS APPLY dbo.DelimitedSplit8K(col,',') s1

    WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

    )

    -- All combinations when there's 3

    SELECT item = CAST(a.item AS varchar(10))+ ',' +CAST(b.item AS varchar(10))

    FROM trips a, trips b

    WHERE a.col = b.col AND a.item < b.item

    UNION ALL

    SELECT col

    FROM @table

    WHERE LEN(col)-2 = LEN(REPLACE(col,',',''))

    UNION ALL

    SELECT col

    FROM @table

    WHERE LEN(col)-1 = LEN(REPLACE(col,',',''))

    UNION

    SELECT item

    FROM @table

    CROSS APPLY dbo.DelimitedSplit8K(col,',')

    --ORDER BY item

    PRINT 'RCTE:'+char(13)+char(10)+REPLICATE('-',50);

    ;with Z as

    (

    selectRowID = Y.IndexID,

    ColID = V.ItemNumber,

    Col = V.Item

    from@table Y cross APPLY dbo.DelimitedSplit8K ( Y.Col, ',' ) V

    ), W as

    (

    selectRowID,

    NoCols = max(ColID)

    fromZ

    group by RowID

    ), R as

    (

    selectID = 1,

    RowID,

    Col,

    LastCol = convert(int,Col)

    fromZ

    union all

    selectID = R.ID + 1,

    RowID = R.RowID,

    Col = R.Col + ',' + Z.Col,

    LastCol = convert(int,Z.Col)

    fromW inner join

    R on

    R.RowID = W.RowID and

    R.ID < W.NoCols inner join

    Z on

    Z.RowID = R.RowID and

    convert(int,Z.Col) > R.LastCol

    )

    SELECT DISTINCT Col

    FROM R

    --ORDER BY col;

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    DBCC TRACEOFF(2453);

    Perf Test results:

    MY SOLUTION:

    --------------------------------------------------

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    SQL Server parse and compile time:

    CPU time = 31 ms, elapsed time = 32 ms.

    Table '#A73A2F66'. Scan count 25, logical reads 120, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 3458, logical reads 98032, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 547 ms, elapsed time = 427 ms.

    RCTE:

    --------------------------------------------------

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    SQL Server parse and compile time:

    CPU time = 16 ms, elapsed time = 21 ms.

    Table 'Worktable'. Scan count 58811, logical reads 392249, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#A73A2F66'. Scan count 3, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1029 ms, elapsed time = 1193 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    You can see - even with a good splitter the recursive CTE solution is less than half as fast and produces 4X the number of reads.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • N_Muller (11/30/2016)


    Alan.B (11/29/2016)


    My solution can be totally optimized and simplified but here's a solution that uses DelimitedSplit8K (see the link in my signature):

    You'll notice three UNION ALL's. The first query after the CTE gets all combinations when there's 3 values. The second gets me the un-split three-value values. The third one gets me the un-split two-value values and the final query is the single values split out from all the delimited and un-delimited strings.

    edit: added some explanation.

    I actually ended up finding a different solution. Not sure it's faster, but it works nicely. First, I added an identity column to my table variable, but that's not required. The query is as follows:

    declare @table table ( IndexID int identity(1,1), Col varchar(100) )

    insert @table ( Col )

    values ( '173' ), ( '97' ), ( '54,115' ), ( '1,32,173' ), ( '21,32' ), ( '5,32' ), ( '32,173' )

    -- solution

    ;with Z as

    (

    selectRowID = Y.IndexID,

    ColID = V.ID,

    Col = V.ListValues

    from@table Y cross apply

    FN_ListToTable_8K ( Y.Col, ',' ) V

    ), W as

    (

    selectRowID,

    NoCols = max(ColID)

    fromZ

    group by RowID

    ), R as

    (

    selectID = 1,

    RowID,

    Col,

    LastCol = convert(int,Col)

    fromZ

    union all

    selectID = R.ID + 1,

    RowID = R.RowID,

    Col = R.Col + ',' + Z.Col,

    LastCol = convert(int,Z.Col)

    fromW inner join

    R on

    R.RowID = W.RowID and

    R.ID < W.NoCols inner join

    Z on

    Z.RowID = R.RowID and

    convert(int,Z.Col) > R.LastCol

    )

    select distinct Col from R order by Col

    -- The first CTE transforms the strings into rows while keeping the reference to the original RowID (IndexID in this case).

    -- The second CTE gets the number of elements in each original RowID.

    -- The third is a recursive CTE where I keep track of the last element added to the comma delimited list (largest value) and loops for the largest number of elements in each original RowID.

    Can you post a copy of the FN_ListToTable_8K function so that we can try out your code? Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/6/2016)Can you post a copy of the FN_ListToTable_8K function so that we can try out your code? Thanks.

    CREATE FUNCTION [dbo].[FN_ListToTable_8K]

    (

    @listVARCHAR(8000),

    @delimiterCHAR(1)

    )

    RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    --

    -- Thist function has had minimal if any modification on code

    -- by Jeff Moden (SQL Server Central)

    --

    -- CTE Driven "Tally Table" produces values from 1 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS

    (

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS

    (

    SELECT1

    FROME1 a,

    E1 b

    ),--10E+2 or 100 rows

    E4(N) AS

    (

    SELECT1

    FROME2 a,

    E2 b

    ),--10E+4 or 10,000 rows

    cteTally(N) AS

    (

    --

    --This provides the "base" CTE and limits the number of rows right up front

    --for both a performance gain and prevention of accidental "overruns"

    --

    SELECTTOP

    (ISNULL(DATALENGTH(@list),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROME4

    ),

    cteStart(N1) AS

    (

    --

    --This returns N+1 (starting position of each "element" just once for each delimiter)

    --

    SELECT1

    UNION ALL

    SELECTt.N+1

    FROMcteTally t

    WHERESUBSTRING(@list,t.N,1) = @delimiter

    ),

    cteLen(N1,L1) AS

    (

    --

    --Return start and length (for use in substring)

    --

    SELECTs.N1,

    ISNULL(NULLIF(CHARINDEX(@delimiter,@list,s.N1),0)-s.N1,8000)

    FROMcteStart s

    )

    --

    --Do the actual split. The ISNULL/NULLIF combo handles the length for the final

    --element when no delimiter is found.

    --

    SELECTID= ROW_NUMBER() OVER(ORDER BY l.N1),

    ListValues= SUBSTRING(@list, l.N1, l.L1)

    FROMcteLen l

  • N_Muller (12/6/2016)


    Jeff Moden (12/6/2016)Can you post a copy of the FN_ListToTable_8K function so that we can try out your code? Thanks.

    CREATE FUNCTION [dbo].[FN_ListToTable_8K]

    (

    @listVARCHAR(8000),

    @delimiterCHAR(1)

    )

    RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    --

    -- Thist function has had minimal if any modification on code

    -- by Jeff Moden (SQL Server Central)

    --

    -- CTE Driven "Tally Table" produces values from 1 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS

    (

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS

    (

    SELECT1

    FROME1 a,

    E1 b

    ),--10E+2 or 100 rows

    E4(N) AS

    (

    SELECT1

    FROME2 a,

    E2 b

    ),--10E+4 or 10,000 rows

    cteTally(N) AS

    (

    --

    --This provides the "base" CTE and limits the number of rows right up front

    --for both a performance gain and prevention of accidental "overruns"

    --

    SELECTTOP

    (ISNULL(DATALENGTH(@list),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROME4

    ),

    cteStart(N1) AS

    (

    --

    --This returns N+1 (starting position of each "element" just once for each delimiter)

    --

    SELECT1

    UNION ALL

    SELECTt.N+1

    FROMcteTally t

    WHERESUBSTRING(@list,t.N,1) = @delimiter

    ),

    cteLen(N1,L1) AS

    (

    --

    --Return start and length (for use in substring)

    --

    SELECTs.N1,

    ISNULL(NULLIF(CHARINDEX(@delimiter,@list,s.N1),0)-s.N1,8000)

    FROMcteStart s

    )

    --

    --Do the actual split. The ISNULL/NULLIF combo handles the length for the final

    --element when no delimiter is found.

    --

    SELECTID= ROW_NUMBER() OVER(ORDER BY l.N1),

    ListValues= SUBSTRING(@list, l.N1, l.L1)

    FROMcteLen l

    I'm thinking that code is going to work pretty well for you. 😀 Thanks for the mention in the comments.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Shifting gears back to the problem at hand, will there ever be CSVs with more than 3 elements?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/6/2016)


    Shifting gears back to the problem at hand, will there ever be CSVs with more than 3 elements?

    Yes, but the most I've seen in my scenarios is about 20.

  • Jeff Moden (12/6/2016)

    I'm thinking that code is going to work pretty well for you. 😀 Thanks for the mention in the comments.

    I give credit, where credit is due 🙂

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply