Generate n random int subsets

Question

Generate n random int subsets

Xedni

SSCertifiable

Points: 6951
More actions
June 25, 2016 at 8:46 pm

#314667

Howdy SSC,
I'm trying to generate some realistic test data to do some load testing. The app I'm testing takes a parameter of a comma delimited list of integers, representing different stocks. For simplicity's sake, lets say represented by the numbers 1 through 10,000. The app caches calls, so I can't just pass the same set of symbols each time, or it won't hit all the code I want. Order of the subset doesn't matter to me
What I'd like to do is generate about a million random subsets of that list with each subset containing, say, between 1 and 100 identifiers.
So a dumbed down version might be, given [1,2,3,4,5] and I want to generate n=5 subsets of between 1 and 3 symbols
1) [1,3]
2) [2,3,5]
3) [4]
4) [1,3]
5) [2,3,4,5]
While i recognize this is HORRIBLY RBAR, this is what I'm trying to get at. Any suggestions for scaling this up and getting rid of the loop would be most welcome
declare
@numSubsets int = 50,
@subsetMaxSize int = 5,
@i int = 1,
@topCt int
if object_id('tempdb.dbo.#symbols') is not null drop table #symbols
create table #symbols
(
SymbolId int primary key clustered
)
if object_id('tempdb.dbo.#Subsets') is not null drop table #Subsets
create table #Subsets
(
SubsetId int,
SymbolId int
)
insert into #symbols (SymbolId)
select top 5 row_number() over (order by (select null))
from sys.all_objects
while @i <= @numSubsets
begin
select @topCt = (abs(checksum(newid())) % @subsetMaxSize)
insert into #Subsets
(
SubsetId,
SymbolId
)
select top (@topCt)
SubsetId = @i,
SymbolId = SymbolId
from #Symbols
order by newid()
select @i += 1
end
select top 1000 *
from #Subsets
Executive Junior Cowboy Developer, Esq.[/url]

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182911 More actions · Answer 1

Here two fairly simple set based solutions based on the same logic. The first is pretty fast generating smaller sets and doesn't use a temporary table, the second one uses a temporary table and is better suited for larger sets.

😎

USE tempdb;

GO

SET NOCOUNT ON;

-- Test set configurations:

-- @ENTRY_NUMBER, size of the population

DECLARE @ENTRY_NUMBER INT = 10000;

-- @MAX_SET_ENTRIES, maximum number of entries from the population

-- in each subset.

DECLARE @MAX_SET_ENTRIES INT = 100;

-- @SAMPLE_SIZE, number of subsets

DECLARE @SAMPLE_SIZE INT = 1000;

-- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE

;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

-- BASE_SET returns @SAMPLE_SIZE number of rows with a pseudo random number from

-- 1 to @MAX_SET_ENTRIES which is controls the subset size for the entry.

,BASE_SET AS

(

SELECT

NM.N AS SAMPLE_ID

,((ABS(CHECKSUM(NEWID())) % @MAX_SET_ENTRIES) + 1) AS SAMPLE_SIZE

FROM NUMS NM

)

-- BASE_WITH_SUBSETS adds one row to the set for each member of the subset

-- and the numerical identifier for that member

,BASE_WITH_SUBSETS AS

(

SELECT DISTINCT

BS.SAMPLE_ID

,BS.SAMPLE_SIZE

,CONVERT(VARCHAR(12),((ABS(CHECKSUM(NEWID())) % @ENTRY_NUMBER) + 1),0) AS SUB_VAL

FROM BASE_SET BS

CROSS APPLY

(

SELECT TOP(BS.SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9

) AS SNUM(N)

)

-- The final select concatenates the subsets returning one row for each

-- set and it's subset as a comma delimited value string.

SELECT

BWS.SAMPLE_ID

,STUFF((

SELECT

CHAR(44) + SBWS.SUB_VAL

FROM BASE_WITH_SUBSETS SBWS

WHERE BWS.SAMPLE_ID = SBWS.SAMPLE_ID

FOR XML PATH('')

),1,1,'') AS SUB_SET

FROM BASE_WITH_SUBSETS BWS

GROUP BY BWS.SAMPLE_ID;

USE tempdb;

GO

SET NOCOUNT ON;

-- Drop the temporary table if it exists

IF OBJECT_ID('tempdb..#TMP_BASE_WITH_SUBSETS') IS NOT NULL DROP TABLE #TMP_BASE_WITH_SUBSETS;

-- Test set configurations:

-- @ENTRY_NUMBER, size of the population

DECLARE @ENTRY_NUMBER INT = 10000;

-- @MAX_SET_ENTRIES, maximum number of entries from the population

-- in each subset.

DECLARE @MAX_SET_ENTRIES INT = 100;

-- @SAMPLE_SIZE, number of subsets

DECLARE @SAMPLE_SIZE INT = 1000000;

-- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE

;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

-- BASE_SET returns @SAMPLE_SIZE number of rows with a pseudo random number from

-- 1 to @MAX_SET_ENTRIES which is controls the subset size for the entry.

,BASE_SET AS

(

SELECT

NM.N AS SAMPLE_ID

,((ABS(CHECKSUM(NEWID())) % @MAX_SET_ENTRIES) + 1) AS SAMPLE_SIZE

FROM NUMS NM

)

-- BASE_WITH_SUBSETS adds one row to the set for each member of the subset

-- and the numerical identifier for that member

,BASE_WITH_SUBSETS AS

(

SELECT DISTINCT

BS.SAMPLE_ID

,BS.SAMPLE_SIZE

,CONVERT(VARCHAR(12),((ABS(CHECKSUM(NEWID())) % @ENTRY_NUMBER) + 1),0) AS SUB_VAL

FROM BASE_SET BS

CROSS APPLY

(

SELECT TOP(BS.SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9

) AS SNUM(N)

)

SELECT

BWS.SAMPLE_ID

,BWS.SUB_VAL

INTO #TMP_BASE_WITH_SUBSETS

FROM BASE_WITH_SUBSETS BWS;

CREATE NONCLUSTERED INDEX NCLIDX#TMP_BASE_WITH_SUBSETS_SAMPLE_ID_INCL_SUB_VAL ON #TMP_BASE_WITH_SUBSETS (SAMPLE_ID ASC) INCLUDE (SUB_VAL);

-- The final select concatenates the subsets returning one row for each

-- set and it's subset as a comma delimited value string.

SELECT

BWS.SAMPLE_ID

,STUFF((

SELECT

CHAR(44) + SBWS.SUB_VAL

FROM #TMP_BASE_WITH_SUBSETS SBWS

WHERE BWS.SAMPLE_ID = SBWS.SAMPLE_ID

FOR XML PATH('')

),1,1,'') AS SUB_SET

FROM #TMP_BASE_WITH_SUBSETS BWS

GROUP BY BWS.SAMPLE_ID;

Sample output

SAMPLE_ID SUB_SET

---------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 1182,1185,1287,1303,1347,136,1702,2064,2088,2178,2328,238,2388,2410,2519,2553,2586,2628,2694,2757,2772,2888,3113,3277,3345,3406,3414,3542,3873,3908,3909,3995,4081,4225,4247,4348,4442,4464,4544,4955,4960,5121,5542,5544,5673,5882,5959,6120,6188,6838,6851,698

2 1585,16,1875,2771,5113,5377,5531,6486,694,786,8477,8570

3 1025,1100,1181,1212,1227,1296,1418,1419,1866,1874,188,2009,2049,2261,2325,244,2442,2473,2547,3061,3096,3235,3247,3343,3356,3563,3798,3944,4092,4110,4311,438,4426,4539,4601,4773,4809,4836,4929,4955,4972,4976,4996,5178,5196,5250,53,5400,5582,5605,5791,5807,6

4 1198,1545,1738,1773,1817,1819,2109,2174,2313,2316,2456,2589,279,3419,3782,4076,4139,422,4347,4436,4457,4536,4565,4816,4971,5057,5288,5404,5450,5686,5710,6258,6623,6685,6691,67,6744,6874,696,71,7289,7440,7505,7552,7902,8199,8250,8390,850,8526,8688,893,9151,

5 1151,119,1298,1380,1394,1478,1551,1552,1603,1702,1989,1996,2158,2211,2278,233,254,2563,2712,2748,2763,2882,2889,3028,3075,3276,3385,3410,3470,4006,4111,4151,4298,4579,4650,5084,5279,5296,5347,5488,5792,5883,5898,6018,6223,636,6374,6381,6399,6531,6674,6676,

6 128,1367,1662,2189,2717,3537,3660,4307,4422,6662,7956,8131,8975

7 1322,140,1534,1551,1681,2292,241,2431,2628,2649,2679,2815,284,2949,30,3159,322,3264,3852,413,4440,4523,4670,4936,4937,495,5005,5048,5056,5099,5964,6079,6331,6446,6693,694,7414,8126,8178,870,9089,9238,9241,9262,9369,9474,9610,9650,9815

8 1119,1446,1596,1635,1700,1788,1809,2003,2186,2259,2483,2646,270,2721,2791,2851,2880,2892,2966,3024,3153,3160,3200,3212,3213,3299,3309,3467,3496,3554,3757,3767,4178,4188,4400,447,4485,450,5035,504,53,5466,5531,5612,5675,5765,5899,5902,6037,6043,6045,6100,63

9 1093,12,1212,1326,1417,1474,1604,1652,1762,1829,1871,190,1925,2090,2155,2692,2790,2797,2925,3062,3077,315,3543,3845,42,4599,4997,5074,5149,5202,5222,5233,5521,5554,5663,5797,5952,6148,6173,6203,626,6275,635,6534,6591,6600,6711,6968,7013,711,7311,7319,7374,

10 1719,1820,2325,2578,3426,3560,3666,3806,3817,3875,4149,4726,4776,5530,5641,6185,6403,7118,7340,740,7401,7520,7645,8293,8472,8483,8564,9054,907,9691,9747

Edit: added the second solution using a temporary table.

Xedni SSCertifiable Points: 6951 More actions · Answer 2

Thanks for the response. Taking a look now...

Executive Junior Cowboy Developer, Esq.[/url]

Xedni SSCertifiable Points: 6951 More actions · Answer 3

Hi Erikur,

Thanks again for the reply. To preface this, your post was VERY helpful, and I doubt I would have got to the version I'm happy with now without your suggestions. I'll just say where I had to make some changes. Yours didn't do quite what I wanted right out of the box, but it gave me enough idea to get it working for what I needed. For what it's worth, when I ran your temp table version ran for about 50 minutes on my machine before I just killed it. Truthfully I didn't to a lot of profiling on it because there were a few other things I wanted to rewrite anyway. While using a random number mod @ENTRY_NUMBER is a really clever way of getting symbols, it can produce duplicates in a single set. And, unfortunately in my real set, the numbers are not uniformly distributed between 1 and 10000. Not a big deal, and given the information I gave you, perfectly valid assumptions.

After working through your sample code, here's what I came up with. All in all, it took about a 5 minutes to generate 1 million samples - start to finish. I'ts pretty disk intensive on tempdb (part of that may be the setup I have at home), but memory and memory never got too bad.

Note: I already had a persisted tally table (in my script, sysmon.dbo.numbers; I have one on ever server I ever use) so I don't use an in-line tally table

I also tried batching various steps like pulling the symbols, or serializing them, but all took longer, and didn't seem to make a sizable impact on system resources.

use tempdb

go

set nocount on

go

declare

@universeSize int = 10000,

@maxSubsetSize int = 100,

@maxSubsets int = 1000000

if object_id('tempdb.dbo.#Symbols') is not null drop table #Symbols

create table #Symbols

(

SymbolId int primary key clustered

)

if object_id('tempdb.dbo.#Samples') is not null drop table #Samples

create table #Samples

(

SampleId int primary key clustered,

SampleSize int,

Samples nvarchar(max)

)

if object_id('tempdb.dbo.#SampleSymbols') is not null drop table #SampleSymbols

create table #SampleSymbols

(

SampleId int,

SymbolId int

--primary key clustered (SampleId, SymbolId) with (data_compression = page)

)

-- This will normally be derived from a real table, so you can't assume uniform distribution, or even that all the values will be lower than some number.

-- Get a list of

insert into #Symbols (SymbolID)

select top 10000 SymbolID = num

from sysmon.dbo.numbers

order by newid() -- Just to give the list an air of randomness, so you don't try to use tricks involving sequential numbers. Not strictly necessary at all.

-- Insert an integer identifer for each sample you want to generate, plus a random number between 1 and @maxSubsetSize saying how many symbols should be in the sample

-- This serves as the framework for each group. SampleID is the identifier ofa given group

-- SampleSize is how mny rows it should randomly take off the symbol list

insert into #Samples

(

SampleID,

SampleSize

)

select top (@maxSubsets)

SampleID = num,

SampleSize = (abs(checksum(newid())) % @maxSubsetSize) + 1

from sysmon.dbo.numbers

-- DON PUT A CLUSTERED INDEX ON THIS. ITS WAAAAAAAAY FASTER INTO A HEAP

-- For each sample, cross apply a randomized subset of #Symbols with a top count = [SampleSize] (which we set in the previou statement)

insert into #SampleSymbols

(

SampleID,

SymbolId

)

select

s.SampleId,

x.SymbolId

from #Samples s

cross apply (select top (s.SampleSize) SymbolId

from #Symbols

order by newid()) x

--1.5 min on on about 50m rows

-- Cluster one up homie

-- Note I tried this witout clustering the #SampleSymbols table and it did horribly

create unique clustered index #PKC__#SampleSymbols on #SampleSymbols (SampleID, SymbolID)

with (Data_compression = none, maxdop = 4, online = off)

-- Serialize data

update o

set Samples = '[' + stuff((select ',{XID:' + cast(SymbolId as varchar(30)) + '}'

from #SampleSymbols i

where i.SampleId = o.sampleId

for xml path('')), 1, 1, '') + ']'

from #Samples o

Executive Junior Cowboy Developer, Esq.[/url]

Eirikur Eiriksson SSC Guru Points: 182911 More actions · Answer 4

Hi Xedni,

you are very welcome. I do see one problem with your code and that is that all subsets of the same size will be identical (duplicates), is this what you intended?

😎

Xedni SSCertifiable Points: 6951 More actions · Answer 5

Eirikur Eiriksson (6/27/2016)
...all subsets of the same size will be identical (duplicates), is this what you intended?
😎

Ew... you're right. Nice catch.

Executive Junior Cowboy Developer, Esq.[/url]

Sergiy SSC Guru Points: 110208 More actions · Answer 6

This would be much shorter version:

SELECT n.N ID, STUFF(

(SELECT ',' + CONVERT(VARCHAR(20), n3.RandomN)

FROM

--picking a random quantity of random numbers per ID

(SELECT TOP 1 N topn FROM dbo.TallyGenerator(1,3,1) WHERE n.N > 0 ORDER BY NEWID() ) n2

-- now picking the random numbers themselves, [topn] of different numbers per each ID

CROSS APPLY (SELECT TOP (topn) N RandomN FROM dbo.TallyGenerator(1,1000,1) WHERE n.N > 0 ORDER BY NEWID()) n3

FOR XML PATH('')

), 1,1,'')

--generating ID's - may be coming from a table with actual set of ID's, does not have to be generated

FROM dbo.TallyGenerator(1,10000,1) n --(SELECT id FROM #Numbers GROUP BY id) n

ORDER BY id

dbo.TallyGenerator here is an inline table function generating sequential numbers from @Param1 to @Param2 with step @Param3.

May be replaced with the commonly used CTE.

_____________
Code for TallyGenerator

Xedni SSCertifiable Points: 6951 More actions · Answer 7

Could you provide the code for TallyGenerator? Since I'm not actually trying to just generate random numbers, it's difficult to see where I'd slot in my legitimate symbols, or how it would scale.

Executive Junior Cowboy Developer, Esq.[/url]

Sergiy SSC Guru Points: 110208 More actions · Answer 8

Xedni (6/27/2016)
Could you provide the code for TallyGenerator? Since I'm not actually trying to just generate random numbers, it's difficult to see where I'd slot in my legitimate symbols, or how it would scale.

One of the versions is right there, in Erikur's post:

-- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE

;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

It uses a single parameter (numbers to ..) .

Parameters "numbers from" and "increment" are hardcoded as 1.

Put this code into a function - and here you go.

_____________
Code for TallyGenerator

Sergiy SSC Guru Points: 110208 More actions · Answer 9

That's how you do it:

CREATE FUNCTION dbo.TallyGenerator (

@EndNumber bigint

)

RETURNS TABLE

AS RETURN

(-- Inline Tally table producing a number sequence from 1 to @EndNumber

WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

SELECT TOP(@EndNumber) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9

)

GO

SELECT * FROM dbo.TallyGenerator(15) tg

_____________
Code for TallyGenerator

Xedni SSCertifiable Points: 6951 More actions · Answer 10

Gotcha. Unfortunately, that function doesn't have the same signature as the one in your original code sample. I modified the function to accept all three parameters (start, end and increment; num % @increment, while still being an inline tvf), but cancelled it after 5 minutes. It's an interesting (and informative) approach, however it would seem simplicity in this case doesn't equate to performance.

Executive Junior Cowboy Developer, Esq.[/url]

Sergiy SSC Guru Points: 110208 More actions · Answer 11

Xedni (6/27/2016)
Gotcha. Unfortunately, that function doesn't have the same signature as the one in your original code sample. I modified the function to accept all three parameters (start, end and increment; num % @increment, while still being an inline tvf), but cancelled it after 5 minutes. It's an interesting (and informative) approach, however it would seem simplicity in this case doesn't equate to performance.

That's odd.

How many sets are your trying to generate?

It took 7 seconds on my laptop to generate 10k sets of up to 30 numbers from the selection of 1000.

_____________
Code for TallyGenerator

Sergiy SSC Guru Points: 110208 More actions · Answer 12

Here is the code I used:

CREATE FUNCTION dbo.TallyN (

@EndNumber bigint

)

RETURNS TABLE

AS RETURN

(-- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE

WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

SELECT TOP(@EndNumber) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9

)

GO

SELECT n.N ID, STUFF(

(SELECT ',' + CONVERT(VARCHAR(20), n3.RandomN)

FROM

--picking a random quantity of random numbers per ID

(SELECT TOP 1 N topn FROM dbo.TallyN(30) WHERE n.N > 0 ORDER BY NEWID() ) n2

-- now picking the random numbers themselves, [topn] of different numbers per each ID

CROSS APPLY (SELECT TOP (topn) N RandomN FROM dbo.TallyN(1000) WHERE n.N > 0 ORDER BY NEWID()) n3

FOR XML PATH('')

), 1,1,'')

--generating ID's - may be coming from a table with actual set of ID's, does not have to be generated

FROM dbo.TallyN(10000) n

ORDER BY id

Execution time - 7 seconds.

_____________
Code for TallyGenerator