April 5, 2006 at 12:27 pm
Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/dpoole/generatingarange.asp
April 12, 2006 at 1:28 am
Hello,
personnally, I would use some statements along the lines of the ones below (taken from an article by Y.Ben-Gan, on http://www.sqlmag.com; search in google for the words CREATE TABLE NUMS). The idea is to populate the table exponentially, by doubling the number of records at each iterations.
In order to achieve the desired results, one would populate the table by setting @max-2 = (@end - @start +1) where @start and @end denotes the desired range, and then issuing an update statement like
UPDATE nums set n = n -1 + @start
I wonder how this solution would compare to the others in terms of performance ...
Sincerly,
Emmanuel Nanchen
CREATE TABLE Nums(n INT NOT NULL);
DECLARE @max-2 AS INT, @rc AS INT;
SET @max-2 = 1000; -- revise max number according to your needs
SET @rc = 1;
BEGIN TRAN
INSERT INTO Nums VALUES(1);
BEGIN
INSERT INTO Nums SELECT n + @rc FROM Nums;
END
INSERT INTO Nums
SELECT n + @rc FROM Nums WHERE n + @rc <= @max-2;
COMMIT TRAN
ALTER TABLE Nums ADD CONSTRAINT PK_Nums PRIMARY KEY(n);
Emmanuel Nanchen
April 12, 2006 at 2:05 am
I am dealing long enough with number ranges. I did tried approaches similar to ones described in this article. Sadly, the only really fast solution here is the dumbest one.
Create a table NUMS and fill it with as many numbers as needed. Add index or primary key. Then use this pre-filled table in any sql statements that require sequential range of numbes. This is really fast. It takes less that a second in my application.
April 12, 2006 at 4:26 am
Hi everybody,
I used to make some tests of the exponential approach, also posted some results on SSC and it was faster than any other approach (we are talking about large result sets, not those with 1000 records ). I agree with aka'a oppinion,that having a static table is the fastest way, but it has limitations, too. Not always you know in advance the maximum number of records, and eventually it has to be populated somehow at the beginning. So, if you ask me, the really usefull solution is a table-valued UDF that returns a recordset that it populates internally, using the exponential approach. I tested with UDFs I posted on SSC few years ago, and the final cross join code completed in 4 minutes, while the UDF completed in 13 seconds. In both cases, I executed
select count(*) from the_UDF_name (5, 567720)
Regards,
Goce.
April 12, 2006 at 6:32 am
Hi,
Echoing the post about the algorithm suggested above that is by Itzik Ben Gan. This is the one I regularly use and after doing a brief performance test against the best performing query in the article I am getting average results of
fnCrossJoinRange2(5,567720): 14.5 s
and
fn_nums(5,567720): 11.2 s
I have taken Itziks algorithm and added the min and max logic.
CREATE FUNCTION dbo.fn_nums(@min int, @max-2 int)
RETURNS @Nums TABLE(n int NOT NULL)
AS
BEGIN
IF @max-2 > 0
BEGIN
DECLARE @factor AS int
SET @factor = 1
INSERT INTO @Nums VALUES(@min)
WHILE @factor * 2 <= @max-2
BEGIN
INSERT INTO @Nums SELECT n + @factor FROM @nums
SET @factor = @factor * 2
END
INSERT INTO @Nums
SELECT n + @factor FROM @nums WHERE n + @factor <= @max-2
END
RETURN
END
Thanks
Alex Weatherall
April 12, 2006 at 7:00 am
I can't find the article by Itzik Ben Gan - can someone please post a link.
Any idea when he 'came up with' the algorithm? I ask because it is suspiciously like the one I posted on here in September 2002!
http://www.sqlservercentral.com/scripts/viewscript.asp?scriptid=321
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
April 12, 2006 at 7:22 am
he 'came up with' the algorithm
Hi, those quotes are yours. I'm not trying to imply that it is *his* algorithm (even though that was what I said), but that he has promoted it as a efficient way of generating numbers on the fly.
Yes, your script does perform the generation in a similar manner. But it is conceivable that you've independently come up with the same idea 🙂
Cheers
Alex
April 12, 2006 at 7:42 am
If a very specific range is required, a mixed approach can be used. So the predefined NUMS table is used as a source, with one or two outer SQLs that multiply, add or whatever the source range.
April 12, 2006 at 8:44 am
I disagree with David’s conclusion that
"However, I found that for small ranges, say under 500 records, there was little difference in performance between the procedural and set based versions. Once beyond the 500 threshold the set based version won hands down."
I am working with a result set of 500 rows but my initial value is a large number and the set based approach on my test machine takes 6 seconds to run and the procedural approach takes less then a second
SELECT * FROM [dbo].[fnCrossJoinRange2](1001000, 1001500) Run time 6 sec
SELECT * FROM [dbo].[fnRange](1001000, 1001500) Run time 0 sec
I like David’s approach but it does not perform as well for large initial values
April 12, 2006 at 11:20 am
Here is a solution using CLR integration. It comes in around 5 seconds on my machine for the range range 5 to 567720.
public partial class UserDefinedFunctions
[Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName="NumberRangeFillRow",
TableDefinition="number int",
IsPrecise=true,
IsDeterministic=true)]
public static System.Collections.IEnumerable NumberRange(int first, int last)
{
int counter = first;
while(counter <= last)
{
yield return counter;
counter++;
}
}
public static void NumberRangeFillRow(object obj, out int number)
{
number = (int)obj;
}
};
April 12, 2006 at 12:24 pm
I must admit I didn't look at the possibility of a high initial value but with a small range.
Following it through logically all the selects would come into play so in this scenario it would perform much slower.
If you produce the range using
SELECT 1000000+value FROM [dbo].[fnCrossJoinRange2](0, 500)
what sort of results do you get?
The one thing that does show up here is that there are many, many solutions and it is always worth spending a little time researching alternatives.
April 13, 2006 at 10:49 pm
Well done, David. It's nice to see it when someone takes multiple methods and does a really good performance comparison.
It seems that generating a range of numbers in a temporary storage area is something that a lot of folks want to do and have always had to resort to some form of loop or a complex setup where they create mini-tables to grow bigger ones using cross-joins. I believe the reason for this is because the IDENTITY function does not allow variables for the seed nor the increment and the TOP clause is not programmable in versions less than 2005.
Speaking of alternatives, I've solved that problem for temp tables in the code below... and the meat of the code uses a single insert in the spirit of the closing remarks in your article's conclusion. It won't work for table variables so it cannot be used in a function.
As a reminder to everyone, table variables do not use statisics nor can they be made to do so. As a result, they are much slower than their more effective temp table "siblings". There are a lot of things that are wrong with table variables and a lot of incorrect information about temp tables. But don't take my word for it... check out the following URL. You may never use a table variable again except when a multi-row structure is required as a return from a function...
http://support.microsoft.com/default.aspx?scid=kb;en-us;305977&Product=sql2k
OK, here's the programmable numeric range generation code I promised. The very first time you run it, it will take about 7 seconds to run on 500,000 rows including the addition of the clustered primary key. Once compiled, subsequent runs take less than 4 seconds to generate a range of 500,000 records no matter what the starting number is and that includes the addition of the clustered primary key... obviously, the smaller the range, the faster this one will run. It generates 500 records in about 16 milli-seconds no matter the starting number...
--===== Declare local performance measuring variable
-- This would not be part of the normal code
DECLARE @StartTime DATETIME
--===== Start the timer
SET @StartTime = GETDATE()
--===== Declare local variables
DECLARE @LoVal INT
DECLARE @HiVal INT
DECLARE @Vals INT
DECLARE @SQL NVARCHAR(300)
--===== Set the local variables
SET @LoVal = 500000 --Could be param for sp
SET @HiVal = 1000000 --Could be param for sp
SET @Vals = @HiVal-@LoVal+1
--===== If the temp table already exists, drop it
IF OBJECT_ID('TempDB..#Tally') IS NOT NULL
DROP TABLE #Tally
--===== Create the new temp table
-- The "One" column is a dummy target
CREATE TABLE #Tally (N INT IDENTITY(1,1) ,One TINYINT)
--===== Change the starting value of IDENTITY
DBCC CHECKIDENT (#Tally, RESEED, @LoVal) --Doesn't work on table variables
--===== Ready to rock... populate the table
-- using a nasty ol' cross join
SET ROWCOUNT @Vals
INSERT INTO #Tally (One)
SELECT 1
FROM Master.dbo.SysColumns sc1, --Contains >4800 records
Master.dbo.SysColumns sc2 --so good to about 23 million records
SET ROWCOUNT 0
--===== Add a clustered Primary key
ALTER TABLE #Tally
ADD PRIMARY KEY CLUSTERED (N)
--===== Display the duration of the run
PRINT STR(DATEDIFF(ms,@StartTime,GETDATE())/1000.0,10,3) + ' Seconds total duration'
The key to the programmability of the above code is the use of the DBCC CHECKIDENT command which does allow a variable for the RESEED value. It's also the reason why the table must be created first instead of using something like the following code (hat's off to my DBA, Andy Dowswell, for first showing me this little trick a while back) ...
SET ROWCOUNT @HiVal
SELECT IDENTITY(INT,1,1) AS N
INTO #Tally
FROM Master.dbo.SysColumns sc1, --Contains >4800 records
Master.dbo.SysColumns sc2 --so good to about 23 million records
SET ROWCOUNT 0
Admittedly, the programmable code ends up with an extra column of numbers but TINYINT (or BIT for that matter) only take up 1 byte per row. (Varchar(1) would take 6 bytes even if NULL). Of course, the extra column could be a datetime column where you update the column with a starting date + the N column +/- some offset to make things work.
In one spot in the code, I say that a couple of variables could be used in a stored procedure... allow me to clarify... you would have to first create the #Tally temp table in the procedure that calls range generation proc in order for this to work as an SP because the scope of non-global temp tables is "downward", not "upward" in calls.
Anyway, thanks for posting a great article and for all the other contributions and posts you've made on this and other SQL forums... you're definitely one of the good guys.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 14, 2006 at 12:36 am
Wow, the message above looks even better than original article.
April 14, 2006 at 8:23 am
Thanks to everyone who has posted.
One feature I would really like to see on SQL Server Central (but it is in the big bucks content management arean) is the ability for an author to revise their article on the basis of feedback.
Clearly the algorithmn for the progressive doubling of the number range should be included, as should the IDENTITY variation.
I did consider the idea of storing a range in a specific table with a clustered index on it but rejected it on the grounds that storing 2.1 billion records to allow for all variations of a 32 bit int wouldn't be popular.
If I was to use the static table approach I would probably stick it in its own database so that it wouldn't artificially inflate the backups.
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply