March 12, 2006 at 9:57 pm
I am working on a simulation which takes 120-300 records, randomly assigns them to 10 arbitrary groups and then performs a statistical analysis. The statistical analysis is not problematic, but the random group assignment requires some additional work. Note: the groups must be evenly distributed as in each group should have the same number of records in it. For example, if there were 100 records and ten groups, each group would have ten records. I would also like to use the best random number generator possible. Finally, this portion of the simulation exercise will be repeated for 50K to 1M iterations (TBD). As such, having the most performant solution, would be best.
My current solution is to assign a random number to each row, sort the recordset on the random number, and assign the groups based on rank. I can only imagine that there is a better solution.
I have found this article insightful:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsqlpro04/html/sp04c1.asp
The above article also highlights a key problem with the Rand() function in that Rand() is only evaluated once per query AND that Rand() cannot be used in a UDF. Even in SQL Server 2005, use of Rand() in a UDF returns the following: "Invalid use of side-effecting or time-dependent operator in 'rand' within a function." I have not experimented with a .Net assembly yet.
I also did not know if it would be worthwhile to use the SSIS "Row Sampling Transformation" in an iterative fashion to perform the sampling. The only catch is that all records have to be assigned to a group. Therefore, only picking one sample does not help.
Any ideas?
Note: I am using SQL Server 2005 and can also make use of .Net CLR.
March 13, 2006 at 2:13 am
Here's a self standing sample that you may run with impunity...
This turned out to be a bit long winded because an ORDER BY in an UPDATE in the presence of IS NULL just doesn't sort correctly. So, I had to revert to a temporary table to hold the randomized sample...
In order to make the "perfect" distribution you wanted with a possibility that the number of records may not be evenly divisible by the number of groups, I had to make two distributions... one for the "perfect" distribution, and one for the stragglers.
Sorry it's not 100% loopless although each loop (1 for each group) is set based. It's still pretty fast. I haven't played with some of the new ranking features of 2005 but this may give you an idea...
I sure do hope this is what you meant... lemme know, eh?
--========================================================
-- Setup for the experiment
--========================================================
--===== Supress the autodisplay of rowcounts for appearance
SET NOCOUNT ON
--===== Make sure the temp table doesn't exist so can run
-- more than once
IF OBJECT_ID('TempDB..#Records') IS NOT NULL
DROP TABLE #Records
--===== This simulates some group of records from 120-300
-- I use temp tables for these type of experiments.
-- You would use permanent tables.
SELECT TOP 123 --<< Change this number to simulate diff group sizes
IDENTITY(INT,1,1) AS RecNo,
CAST(NULL AS INT) AS GroupID
INTO #Records
FROM Master.dbo.SysComments sc1,
Master.dbo.SysComments sc2
--========================================================
-- All set... demo the solution
-- This makes perfect distributions. If the number
-- of records is not an even product of 10, the
-- remaining records are NOT assigned...
--========================================================
--===== Create a "sample" table to hold random RecNo's
-- that will will be updated with the same GroupID.
-- This is necessary because including an ORDER BY
-- in a sub-query of an update is basically useless
-- when you look for IS NULL... it always returns
-- things in order by RecNo for some reason.
IF OBJECT_ID('TempDB..#Sample') IS NOT NULL
DROP TABLE #Sample
CREATE TABLE #SAMPLE (RecNo INT)
--===== Create a variable to hold the number of groups
DECLARE @Groups INT --<< parameterize this if making a proc
SET @Groups = 10
--===== Determine the "perfect" number of records for each group
DECLARE @GroupSize INT
SELECT @GroupSize = COUNT(*)/@Groups
FROM #Records
--===== This is our "distribution/group" counter
DECLARE @CurrentGroup INT
--===== This forces the update to only update a certain
-- number of records based on group size
SET ROWCOUNT @GroupSize
--===== Perform the "perfect" distribution randomly
SET @CurrentGroup = 1
WHILE @CurrentGroup <= @Groups
BEGIN
TRUNCATE TABLE #Sample
INSERT INTO #Sample (RecNO)
SELECT RecNO
FROM #Records
WHERE GroupID IS NULL
ORDER BY NEWID()
UPDATE #Records
SET GroupID = @CurrentGroup
FROM #Records r,
#Sample s
WHERE r.RecNo=s.RecNo
SET @CurrentGroup = @CurrentGroup+1
END
--========================================================
-- This picks up the rest of the records... there
-- will always be less than the group of 10 here so
-- each record must be given a different number. The
-- lower groups are favored here.
--========================================================
--===== This forces the update to only affect 1 row per
-- loop
SET ROWCOUNT 1
--===== Perform the remaining distribution, 1 rec per group,
-- randomly by RecNo but favors the lower numbered groups.
SET @CurrentGroup = 1
WHILE @CurrentGroup <= @Groups
BEGIN
--===== Create the random sample
TRUNCATE TABLE #Sample
INSERT INTO #Sample (RecNO)
SELECT RecNo
FROM #Records
WHERE GroupID IS NULL
ORDER BY NEWID()
--===== Distribute the random sample
UPDATE #Records
SET GroupID = @CurrentGroup
FROM #Records r,
#Sample s
WHERE r.RecNo=s.RecNo
SET @CurrentGroup = @CurrentGroup+1
END
--===== All done... put things back to normal
SET ROWCOUNT 0
--===== Display the membership count by GroupID
SELECT 'This is the distribution of GroupID''s.'
SELECT GroupID,COUNT(*) AS MembershipCount
FROM #Records
GROUP BY GroupID
--===== Display the records and the assigned groups
SELECT 'Here are the records and the randomized distributed groups.'
SELECT * FROM #Records
p.s. The "Best" random number generator is the NEWID() function.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 13, 2006 at 8:52 am
I do agree with Jeff that for perromance sake you should use the newID but if you need ramdoness quality rand() is the way to go.
You and that article said:
"The above article also highlights a key problem with the Rand() function in that Rand() is only evaluated once per query AND that Rand() cannot be used in a UDF"
That is not entirely accurate
This you can use to trick SQL into using rand() the way you want!
1. Setup a view like:
create view vwRand as
select rand() r
2. Create a UDF from that View
create function dbo.myrand()
returns float
as
begin
return(select r from vwRand)
end
3. just try it:
select name, dbo.myrand() Rnd
from master.dbo.sysobjects
order by name
Mind you that this is not very fast. If using SQL2005 then retrieving the random number from a CLR udf will improve performance by orders of magnitude.
Good Luck,
* Noel
March 13, 2006 at 5:11 pm
That's cool...
You can also use NEWID() as the seed for Rand... in the algo below, "RANGE" is how many numbers you want in the range of random numbers and "OFFSET" is what the starting number should be...
SELECT RAND(CAST(NewID() AS VARBINARY))*range+offset
FROM sometable
For random numbers from 200 to 299, for example, you would use...
SELECT RAND(CAST(NewID() AS VARBINARY))*100+200
FROM sometable
--Jeff Moden
Change is inevitable... Change for the better is not.
September 9, 2022 at 12:25 pm
This was removed by the editor as SPAM
September 9, 2022 at 1:53 pm
Is this sufficient?
DECLARE @NumberOfGroups INT = 10
SELECT a.[name],
a.Rn%@NumberOfGroups + 1 AS Grp
FROM (
SELECT [name], ROW_NUMBER() OVER (ORDER BY NEWID()) AS Rn
FROM SYS.OBJECTS
) AS a
September 8, 2023 at 10:46 am
This was removed by the editor as SPAM
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply