I ran across an interesting problem the other day that required this and I thought I’d post on this topic of n-Tuples as a result.
First what is an n-Tuple? Suppose you have a list of unique character strings like ‘1’ ‘2’ ‘3’ and these are stored as rows in a table. Tuples are simply combinations of these values, so 1-Tuples would be:
1
2
3
2-Tuples would be:
1,2
1,3
2,1
2,3
3,1
3,2
And of course there are these 3-Tuples:
1,2,3
1,3,2
2,1,3
2,3,1
3,1,2
3,2,1
Let’s now write a SQL query that generates Tuples for all n where we have unique strings we want to combine:
DECLARE @t TABLE (strcol CHAR(3)) INSERT INTO @t (strcol) SELECT '001' UNION ALL SELECT '002' UNION ALL SELECT '003' --UNION ALL SELECT '004' UNION ALL SELECT '005' UNION ALL SELECT '006' --UNION ALL SELECT '007' UNION ALL SELECT '008' --UNION ALL SELECT '009' UNION ALL SELECT '010' ;WITH nTuples (n, Tuples) AS ( SELECT DISTINCT 1, CAST(strcol AS VARCHAR(max)) FROM @t UNION ALL SELECT 1 + n.n, t.strcol + ',' + n.Tuples FROM @t t JOIN nTuples n ON t.strcol <> n.Tuples WHERE CHARINDEX(t.strcol, n.Tuples) = 0 ) SELECT * FROM nTuples ORDER BY n, Tuples
This recursive CTE returns the following results set when run for n=3:
n Tuples
1 001
1 002
1 003
2 001,002
2 001,003
2 002,001
2 002,003
2 003,001
2 003,002
3 001,002,003
3 001,003,002
3 002,001,003
3 002,003,001
3 003,001,002
3 003,002,001
Interestingly enough, that wasn’t really the result I needed but it was pretty close. What I really needed was what I call “unique, ordered” n-Tuples. In other words, if I list out the 2-Tuples from above, I can show you only the ones I need:
1,2 [Want]
1,3 [Want]
2,1 [Don’t Want]
2,3 [Want]
3,1 [Don’t Want]
3,2 [Don’t Want]
This means that 1,2 is equivalent to 2,1 but I want only want the first one where the data is ordered across the Tuple. Likewise, there will then be only 1 3-Tuple that I’m interested in: 1,2,3
This actually ends up being a pretty simple change to the above CTE. The not equals (<>) has been changed to less than (<).
;WITH UNIQUEnTuples (n, Tuples) AS ( SELECT DISTINCT 1, CAST(strcol AS VARCHAR(max)) FROM @t UNION ALL SELECT 1 + n.n, t.strcol + ',' + n.Tuples FROM @t t JOIN UNIQUEnTuples n ON t.strcol < n.Tuples WHERE CHARINDEX(t.strcol, n.Tuples) = 0 ) SELECT * FROM UNIQUEnTuples ORDER BY n, Tuples
The results from running this SQL with n=3 are:
n Tuples
1 001
1 002
1 003
2 001,002
2 001,003
2 002,003
3 001,002,003
Thinking about it, I realized that the row sets generated by these bad boys are going to grow quite quickly! After a little bit of playing around and without resorting to looking up the combinatorial statistics, I was able to deduce that the number of records in the respective row sets are generated by these formulas:
n-Tuples: Sum over n + n*(n-1) + n*(n-1)*(n-2) + … + n*(n-1)*…*(n-(n-1))
Unique, ordered, n-Tuples: 2^n-1
For the n-Tuples formula, you can verify in a spreadsheet that the last 2 factors in the series for n>=2 are n! and that, my intrepid reader, becomes a really big number very quickly!
Of course, being constantly performance-minded, I couldn’t help but wonder how well SQL would handle these huge row sets, so I set about to running my SQL across incrementing n’s.
For n-Tuples, I got these results (CPU and elapsed time are in milliseconds):
n #n-Tuples CPU Elapsed
1 1 0 1
2 4 0 1
3 15 0 1
4 64 0 2
5 325 15 83
6 1956 47 202
7 13699 468 831
8 109600 3682 7033
9 986409 41809 116940
10 9864100 485054 1410456
At n=10, I decided to stop because I’d be too impatient to run for n=11!
But you can take the unique, ordered, n-Tuples much further as the progression of increase in number of the rows in the row sets is much slower.
n #n-Tuples CPU Elapsed
1 1 0 0
2 3 0 0
3 7 0 0
4 15 0 0
5 31 0 2
6 63 15 2
7 127 0 4
8 255 15 62
9 511 0 34
10 1023 16 70
11 2047 63 264
12 4095 109 218
13 8191 234 392
14 16383 453 799
15 32767 1030 1717
16 65535 2230 3711
17 131071 4555 8535
18 262143 10452 27458
19 524287 21871 58594
20 1048575 45802 135406
21 2097151 -- --
22 4194303 -- --
23 8388607 -- --
Once again, I’ll be too impatient to wait for n=21 to complete!
There are some caveats to these functions that you should be aware of:
- If there are any duplicates in your source table, these will be eliminated in the results (by design).
- You need to take care if your data points look like A, AA, AAA. Note how the source table has defined this column as CHAR(3). This is also by design. The CHARINDEX statement would remove the n-Tuples (where n>1) of this particular set if you used VARCHAR(3) instead. So if you’re working with integers you want n-Tuples of, your best course of action is to left pad them with 0’s or right pad them with blanks.
We hope you enjoyed this quick excursion into the world of recursive CTEs and n-Tuples. We know we most certainly did. Perhaps even one day it will be as useful for you as it was for me.
Now if I can only figure out how these pesky recursive CTEs really work, then I could post something really interesting!