Introduction
"You must unlearn what you have learned" - Yoda
What exactly is a numbers table and more importantly, what can it do for you? To put it simply, a numbers table is
cheap (read: "free") tool that should be part of every SQL developer's toolkit. In its simplest form, a
numbers table is a table with one INT column. That single column, which also serves as
the primary key and clustered index, contains the natural (or "counting") numbers starting with 1 and going
up to some specified upper limit. We'll begin by describing how to create a numbers table, and then move on to a
discussion of why they are so useful. As always, I'll throw in some sample code (for SQL Server 2000 and
SQL Server 2005) so you can play along at home.
Make Your Own Numbers Table!
Let me start by saying that every database should have its own numbers table. They're very useful (as we'll
see later), and extremely easy to create. We'll start with the creation of the numbers table. This will work in
SQL Server 2000 or SQL Server 2005:
-- Create the numbers table CREATE TABLE dbo.Numbers (Num INT NOT NULL PRIMARY KEY CLUSTERED); GO
Well that was simple enough. Now comes the fun part - populating it. There are lots of methods to populate the
numbers table, but we'll start with the most obvious solution first. This method is a simple
WHILE loop that will populate our numbers table one row at a time:
DECLARE @i INT; SELECT @i = 1; WHILE @i <= 10000 BEGIN INSERT INTO dbo.Numbers(Num) VALUES (@i); SELECT @i = @i + 1; END;
This simple procedural solution works on both SQL 2000 and SQL 2005. There are numerous other clever ways to
create a numbers table. Here's one of my personal favorites (thanks to Jeff Moden and all others who posted it
for public consumption). This method uses the Transact-SQL SELECT INTO
statement and IDENTITY function to simultaneously create and populate
the table, so we have to drop the previously created numbers table. This example works on SQL 2000 or SQL 2005:
-- Be sure to drop the numbers table if it exists DROP TABLE dbo.Numbers; GO -- Now re-create it and fill it with sequential numbers starting at 1 SELECT TOP 10000 IDENTITY(INT,1,1) AS Num INTO dbo.Numbers FROM master.INFORMATION_SCHEMA.COLUMNS i1 CROSS JOIN master.INFORMATION_SCHEMA.COLUMNS i2; GO -- Add a primary key/clustered index to the numbers table ALTER TABLE dbo.Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Num); GO
Notice that this set-based method offers a significant speed improvement over the WHILE loop version. This simple
query relies on the fact that a self-CROSS JOIN creates an exponential result set. If your
master.INFORMATION_SCHEMA.COLUMNS table has over 400 rows in it (not unreasonable), the CROSS JOIN
will generate over 160,000 rows. If you need more just slap another CROSS JOIN in there. We're using the
TOP keyword on the SELECT statement to limit the results to 10,000 rows for our purposes though.
Feel free to populate your numbers table with as many numbers as you need up to the upper limit for an INT,
which happens to be 2,147,483,647.
Now that we have a numbers table it's time to put it to work solving some problems.
The Numbers Table in Action
The power of the numbers table is that it can be used to convert procedural solutions to set-based solutions.
Consider the ubiquitous comma-delimited string-split function. As before we'll start with the procedural version
which basically consists of a WHILE loop. This code works on
SQL 2000 and SQL 2005:
CREATE FUNCTION dbo.fnProceduralSplit(@string VARCHAR(8000)) RETURNS @ret TABLE ( Num INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED, String VARCHAR(255)) AS BEGIN DECLARE @i INT, @j INT; SELECT @i = 1; SELECT @j = CHARINDEX(',', @string); WHILE (@j > 0) BEGIN INSERT INTO @ret (String) VALUES (SUBSTRING(@string, @i, @j - @i)); SELECT @i = @j + 1; SELECT @j = CHARINDEX(',', @string, @i); END; INSERT INTO @ret (String) VALUES (SUBSTRING(@string, @i, LEN(@string) - @i + 1)); RETURN; END GO
All this code does is loop through the string passed in as a parameter looking for commas. It breaks out the
strings between any commas it finds. It's probably about as efficient as you can get for procedural code. So
now let's see what the equivalent function looks like in a set-based version on SQL 2000 and SQL 2005.
In this version we'll let SQL Server do all the heavy lifting. We'll use an inline table-valued function
for performance, and we'll return the number representing the position of each comma in the list to keep the
relative ordering of strings in the list:
CREATE FUNCTION dbo.fnSetSplit (@String VARCHAR(8000)) RETURNS TABLE AS RETURN ( SELECT Num, SUBSTRING(@String, CASE Num WHEN 1 THEN 1 ELSE Num + 1 END, CASE CHARINDEX(',', @String, Num + 1) WHEN 0 THEN LEN(@String) - Num + 1 ELSE CHARINDEX(',', @String, Num + 1) - Num - CASE WHEN Num > 1 THEN 1 ELSE 0 END END ) AS String FROM dbo.Numbers WHERE Num <= LEN(@String) AND (SUBSTRING(@String, Num, 1) = ',' OR Num = 1) )
You can test the functionality of your new UDF's with queries like the following:
SELECT * FROM dbo.fnSetSplit ('Jack,Jill,Hill,Water'); SELECT * FROM dbo.fnSetSplit ('Alexander The Great'); SELECT * FROM dbo.fnSetSplit ('Red,Green,Blue'); SELECT * FROM dbo.fnProceduralSplit ('Army,Navy,Air Force,Marines'); SELECT * FROM dbo.fnProceduralSplit ('Washington,Jefferson,Adams'); SELECT * FROM dbo.fnProceduralSplit ('1,2,3,4,5,6,7,8');
On an individual basis, for single queries like those above you might not notice a large difference in performance,
but for thousands of such queries in a short time period, the difference can be significant. I ran several rounds
of comparisons splitting 100,000 randomly generated comma-separated lists of "words"*. Here
are the averages:
SQL 2000 | SQL 2005 | |
---|---|---|
Procedural Function | 130,783 ms | 86,033 ms |
Set-based Function | 55,390 ms | 51,423 ms** |
Notes:
*The "words" were acutally randomly generated strings from 1 to 20 characters in length.
**In SQL 2005, using the CROSS APPLY operator reduced the time of the
set-based test to 27,296 ms.
Of course performance may vary on your system, but the results I got show a definite improvement when using
the set-based version over the procedural version.
Here Comes the Second Example
OK, still not convinced? Then let's try another example. Here we'll eliminate duplicate side-by-side letters from
a string. This is a commonly performed operation in phonetic encoding and approximate search routines, and here
we'll do it in a simple set-based UDF that will work on SQL 2000 or 2005:
CREATE FUNCTION dbo.fnReduceDupes(@string VARCHAR(8000)) RETURNS VARCHAR(8000) AS BEGIN DECLARE @Result VARCHAR(8000); SELECT @Result = @string; SELECT @Result = STUFF(@Result, Num, 1, CASE SUBSTRING(@Result, Num, 1) WHEN SUBSTRING(@Result, Num + 1, 1) THEN '!' ELSE SUBSTRING(@Result, Num, 1) END) FROM dbo.Numbers WHERE Num <= LEN(@Result); SELECT @Result = REPLACE(@Result, '!', ''); RETURN @Result; END
This UDF reduces sequences of side-by-side duplicate characters to a single character. It does this by first
replacing all but the last of any sequence of side-by-side duplicate characters with an exclamation point
('!', although any character could be used.) It then uses the Transact-SQL REPLACE function to remove
all exclamation points from the input string. Again, this function would be a perfect candidate for
a WHILE loop, but the numbers table lets us tell SQL Server to do the hard work for us. Here are some samples
to try:
SELECT dbo.fnReduceDupes ('AAAABBBBBCCDDDDEEEEFFGGGGHIIIIJJKK'); SELECT dbo.fnReduceDupes('HHHHHIIIIII TTTTHHHEEEERRRRRRREEEE');
One Last Time
OK, so we can see the usefulness of a numbers table in creating set-based position-dependent string functions. But
all the world is not a string; and often well need to apply business logic that has nothing to do with string
manipulation.
Let's consider the SQL 2000 Northwind Database for a moment. This database has a very simple "inventory"
system of sorts -- the kind you might create for a very basic Webstore. Basically the dbo.Products
table has a single column named UnitsInStock. This is your inventory. I say it's very
basic because it lacks any lot/bin or tracking information. The items and quantities of those items ordered are
stored in the ProductID and Quantity columns of the
dbo.[Order Details] table.
The problem is this: we need to match up the items in inventory to specific customer orders. The only real issue
we might face is the total quantity of a particular product on order totals more than the total quantity of that
product in inventory. I.e., we've sold or promised more than we have on hand. In a more detailed inventory
tracking system (such as the SQL 2005 AdventureWorks shelf/bin inventory system), we are looking at more potential
issues (i.e., a customer orders more of a product than is contained in one bin so we have to pull from multiple
bins to fill one order, assigning items from bins to customer orders, etc.)
A row-by-row cursor might seem like the obvious solution for this problem. After all, we need to assign each item
we have in inventory to a customer order detail row. Also we need to keep track of the inventory quantity to make
sure we don't over-commit and try to ship more product than we have on hand. Or do we?
Here I'll present a SQL 2000 Northwind solution to this problem:
SELECT Order_Items.OrderID, Product_Inventory.ProductID, COUNT(*) AS ItemsFromInventory FROM ( SELECT ProductID, Num FROM dbo.Numbers n INNER JOIN dbo.[Products] p ON n.Num BETWEEN 1 AND p.UnitsInStock ) Product_Inventory INNER JOIN ( SELECT OrderID, ProductID, ( SELECT COALESCE(SUM(o1.Quantity),0) FROM dbo.[Order Details] o1 WHERE o1.ProductID = o.ProductID AND o1.OrderID < o.OrderID ) + n.Num AS Num FROM dbo.Numbers n INNER JOIN dbo.[Order Details] o ON n.Num BETWEEN 1 AND o.Quantity ) Order_Items ON Product_Inventory.ProductID = Order_Items.ProductID AND Product_Inventory.Num = Order_Items.Num GROUP BY Order_Items.OrderID, Product_Inventory.ProductID ORDER BY Order_Items.OrderID, Product_Inventory.ProductID;
OK, so what just happened here?
To begin with, we joined each product we have in inventory to the numbers table using
ON n.Num BETWEEN 1
AND p.UnitsInStock. What this does is individually number each product item
ordered. By way of example, if we have two of Product ID 998 in stock and three of Product ID 999 in stock, the
Product_Inventory subquery will generate the following result:
ProductID | Num |
---|---|
998 | 1 |
998 | 2 |
999 | 1 |
999 | 2 |
999 | 3 |
Next we assigned a number to each item on each order. This is the Order_Items subquery.
It's more complex than the Product_Inventory query because we have to account for both
ProductID and OrderID. The complexity comes because the
items ordered and their quantities are spread out over several rows on several different orders. But we want the
count to go from 1...n where n is the total quantity of a single ordered item across all orders.
Functionally we're just replicating SQL 2005's ROW_NUMBER functionality, which will
do this exact same thing for us when the optional PARTITION BY clause is specified.
We INNER JOIN these two subqueries by their
ProductID and Num columns, which effectively assigns one
item from inventory to one item on each order. We don't have to worry about assigning more items than we have in
inventory since the INNER JOINs in the subqueries won't let us assign
more of an item than we have in inventory. After assigning each inventory item to an order detail item, we group
the results. The COUNT(*) of each group gives us the
total number of each item from inventory that was assigned to a given order.
The SQL 2005 version is similar, although the ROW_NUMBER function and
CTE's will probably make the job easier. Of course, as mentioned, the AdventureWorks sample database uses a
shelf/bin inventory system that's also slightly more complex. I'll leave the SQL 2005 version of this query as
an exercise for the reader.
Conclusions
Numbers tables are an extremely useful tool for writing efficient set-based code to replace cursors and other
procedural constructs.