I’ve been using the set-based Sudoku solution as an example in my T-SQL class for a few years, and I found it very well in helping the students to understand:
· The major difference between the procedural approach and set-based approach;
· How the algorithm complexity could be affected by the choice of data structure;
· Different aspects of the language syntax;
· The performance comparison and considerations using different SQL objects.
The script in this post is a tuned version of T-SQL solution for Sudoku puzzle using the set-based approach. Its logic is straight forward, and does use any "not exists" logic. But it does contain a basic iteration loop. The assumption is the puzzle have a sigle solution. Most of the puzzles can be resolved within 20 iterations. Following is some of explanation of the script:
The Data Structure and the Terms used
A Cell: is the cell in the Sudoku puzzle. A resolved cell means we already know the final number in this cell in the solution. An unresolved cell means we still haven’t figured out which numbers should put in this cell yet. An unresolved cell must have multiple possible values. In the initial puzzle, all empty cells should have possible values from 1 to 9.
In this script, I used the binary number to present the possible value(s) in a cell, specifically, 1 =>21-1, 2=>22-1,…, 9=>28. E.g., if a cell has possible values of {4, 5}, its binary value should be 24-1+25-1=24 (in decimal format). In the initial puzzle, the decimal value of an empty cell should be 20+...+28=511.
I used the term Range to reference a row, or a column or a block. There are totally 27 ranges in the puzzle. Each of the 81 cells should be in exactly 3 range contexts: its row, its column and its block. By the game rule, there should be no duplicate number in any range in the solution.
A Combo means any non-empty combination of number 1, …, 9 (mapped to binary numbers of 20, ..., 28 respectively). Simply put, it is a non-empty subset of the set {1, …, 9}. Obviously, the decimal values of all Combos are exactly from 1 to 511.
A fit-cell for a Combo means: all current possible values of that cell are within the Combo set ( i.e., the current possible value set is a subset of that Combo). The T-SQL verification expression is: V - ( V & Combo) = 0, where V is the decimal number of possible values for the cell. While, The non-fit cells must have V - ( V & Combo) > 0.
An Encloded Range Combo, or interchangeably an Enclosed Combo, is a combo for a specific range, where the fit-cells in that range equals the possible numbers of that combo. For example, if in the first row of a puzzle, we have three cells with possible values of {1, 2, 3}, {2, 3} and {1, 3} respectively (decimal values are 7, 6 and 5 respectively), and all other cells has some possible values other than {1, 2, 3}. Then, these three cells are the only fit-cells for the combo {1, 2, 3} (V=7) in this row. So we say, the combo of 7 in the first row is an enclosed combo. You might already realize that number 1, 2 and 3 have to be arranged in these three cells, which further means, number 1, 2 and 3 should be taken out from all other cells’ possible values in the first row. As an extreme scenario, the number in a resolved cell should be taken out from any other cell’s possible values in the same row, column or a block (because a single value cell is enclosed by itself in any range context!)
The advantage of using binary number to present possible values in a cell is:
· Maximum the usage of the bitwise operators. With data structure, I can easily detect whether a cell has certain possible value(s) at current, and I can take out multiple impossible values from a cell in one shot;
· Saved one level of data structure break down when not necessary;
· Faster in performance.
The algorithm and script notes
The main idea of the algorithm is to:
· Repeat following iteration until all cells are resolved:
· Find all enclosed range combo;
· Take out the numbers in the combo from other related cells (non-fit cells in the same range).
In the script, the main workbench is the temp table #Sudoku which holds each cell as a record corresponding to the cells in the puzzle, from left to right, from top down. Of its columns, ID is the sequence number from 1 to 81, the Val column is the decimal format of the binary hash of all possible values in the cell. Other columns (row number, column number, … , etc.) are added only for facilitating the algorithm.
The #num table holds the possible value to binary hash mapping, 1=>21-1, … , 9=>28.
The #Combo table hold all 511 possible combos of {1, 2, … , 511}, with an extra column showing the number of possible values in that combo (for performance reason).
The #MapRange table is created to map each cell (by ID) into three ranges, so that I can do grouping on all ranges with one SQL rather than run the same SQL against rows, columns and blocks three times (in each iteration). I defined the range number as: row number; column number times 10; and block number times 100. So, e.g., for the second cell in the forth row (with ID=29) you will find 3 records in this mapping:
· ID=29, Range=4 (the forth row)
· ID=29, Range=20 (the second column)
· ID=29, Range =400 (the forth block)
The #temp that is created / dropped in each iteration, it can be included into the next CTE sentence right easily, but you will see the performance drops when you do that. You will also get some slow down if you create this temp table only once and keep truncating and inserting in each iteration. I will leave it for you to figure out why.
Performance Tweaks
For performance interests, you can tweak this script in many ways (e.g., use of indexes, table variables, different joins, … , etc.). But it’s the not the main purpose of this post anyway, I will leave it for you if you are interested. But please do let me know if you come up with better ideas and any suggestions.