June 25, 2016 at 8:46 pm
Howdy SSC,
I'm trying to generate some realistic test data to do some load testing. The app I'm testing takes a parameter of a comma delimited list of integers, representing different stocks. For simplicity's sake, lets say represented by the numbers 1 through 10,000. The app caches calls, so I can't just pass the same set of symbols each time, or it won't hit all the code I want. Order of the subset doesn't matter to me
What I'd like to do is generate about a million random subsets of that list with each subset containing, say, between 1 and 100 identifiers.
So a dumbed down version might be, given [1,2,3,4,5] and I want to generate n=5 subsets of between 1 and 3 symbols
1) [1,3]
2) [2,3,5]
3) [4]
4) [1,3]
5) [2,3,4,5]
While i recognize this is HORRIBLY RBAR, this is what I'm trying to get at. Any suggestions for scaling this up and getting rid of the loop would be most welcome
declare
@numSubsets int = 50,
@subsetMaxSize int = 5,
@i int = 1,
@topCt int
if object_id('tempdb.dbo.#symbols') is not null drop table #symbols
create table #symbols
(
SymbolId int primary key clustered
)
if object_id('tempdb.dbo.#Subsets') is not null drop table #Subsets
create table #Subsets
(
SubsetId int,
SymbolId int
)
insert into #symbols (SymbolId)
select top 5 row_number() over (order by (select null))
from sys.all_objects
while @i <= @numSubsets
begin
select @topCt = (abs(checksum(newid())) % @subsetMaxSize)
insert into #Subsets
(
SubsetId,
SymbolId
)
select top (@topCt)
SubsetId = @i,
SymbolId = SymbolId
from #Symbols
order by newid()
select @i += 1
end
select top 1000 *
from #Subsets
June 26, 2016 at 12:35 am
Here two fairly simple set based solutions based on the same logic. The first is pretty fast generating smaller sets and doesn't use a temporary table, the second one uses a temporary table and is better suited for larger sets.
😎
USE tempdb;
GO
SET NOCOUNT ON;
-- Test set configurations:
-- @ENTRY_NUMBER, size of the population
DECLARE @ENTRY_NUMBER INT = 10000;
-- @MAX_SET_ENTRIES, maximum number of entries from the population
-- in each subset.
DECLARE @MAX_SET_ENTRIES INT = 100;
-- @SAMPLE_SIZE, number of subsets
DECLARE @SAMPLE_SIZE INT = 1000;
-- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE
;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))
, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)
-- BASE_SET returns @SAMPLE_SIZE number of rows with a pseudo random number from
-- 1 to @MAX_SET_ENTRIES which is controls the subset size for the entry.
,BASE_SET AS
(
SELECT
NM.N AS SAMPLE_ID
,((ABS(CHECKSUM(NEWID())) % @MAX_SET_ENTRIES) + 1) AS SAMPLE_SIZE
FROM NUMS NM
)
-- BASE_WITH_SUBSETS adds one row to the set for each member of the subset
-- and the numerical identifier for that member
,BASE_WITH_SUBSETS AS
(
SELECT DISTINCT
BS.SAMPLE_ID
,BS.SAMPLE_SIZE
,CONVERT(VARCHAR(12),((ABS(CHECKSUM(NEWID())) % @ENTRY_NUMBER) + 1),0) AS SUB_VAL
FROM BASE_SET BS
CROSS APPLY
(
SELECT TOP(BS.SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9
) AS SNUM(N)
)
-- The final select concatenates the subsets returning one row for each
-- set and it's subset as a comma delimited value string.
SELECT
BWS.SAMPLE_ID
,STUFF((
SELECT
CHAR(44) + SBWS.SUB_VAL
FROM BASE_WITH_SUBSETS SBWS
WHERE BWS.SAMPLE_ID = SBWS.SAMPLE_ID
FOR XML PATH('')
),1,1,'') AS SUB_SET
FROM BASE_WITH_SUBSETS BWS
GROUP BY BWS.SAMPLE_ID;
USE tempdb;
GO
SET NOCOUNT ON;
-- Drop the temporary table if it exists
IF OBJECT_ID('tempdb..#TMP_BASE_WITH_SUBSETS') IS NOT NULL DROP TABLE #TMP_BASE_WITH_SUBSETS;
-- Test set configurations:
-- @ENTRY_NUMBER, size of the population
DECLARE @ENTRY_NUMBER INT = 10000;
-- @MAX_SET_ENTRIES, maximum number of entries from the population
-- in each subset.
DECLARE @MAX_SET_ENTRIES INT = 100;
-- @SAMPLE_SIZE, number of subsets
DECLARE @SAMPLE_SIZE INT = 1000000;
-- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE
;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))
, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)
-- BASE_SET returns @SAMPLE_SIZE number of rows with a pseudo random number from
-- 1 to @MAX_SET_ENTRIES which is controls the subset size for the entry.
,BASE_SET AS
(
SELECT
NM.N AS SAMPLE_ID
,((ABS(CHECKSUM(NEWID())) % @MAX_SET_ENTRIES) + 1) AS SAMPLE_SIZE
FROM NUMS NM
)
-- BASE_WITH_SUBSETS adds one row to the set for each member of the subset
-- and the numerical identifier for that member
,BASE_WITH_SUBSETS AS
(
SELECT DISTINCT
BS.SAMPLE_ID
,BS.SAMPLE_SIZE
,CONVERT(VARCHAR(12),((ABS(CHECKSUM(NEWID())) % @ENTRY_NUMBER) + 1),0) AS SUB_VAL
FROM BASE_SET BS
CROSS APPLY
(
SELECT TOP(BS.SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9
) AS SNUM(N)
)
SELECT
BWS.SAMPLE_ID
,BWS.SUB_VAL
INTO #TMP_BASE_WITH_SUBSETS
FROM BASE_WITH_SUBSETS BWS;
CREATE NONCLUSTERED INDEX NCLIDX#TMP_BASE_WITH_SUBSETS_SAMPLE_ID_INCL_SUB_VAL ON #TMP_BASE_WITH_SUBSETS (SAMPLE_ID ASC) INCLUDE (SUB_VAL);
-- The final select concatenates the subsets returning one row for each
-- set and it's subset as a comma delimited value string.
SELECT
BWS.SAMPLE_ID
,STUFF((
SELECT
CHAR(44) + SBWS.SUB_VAL
FROM #TMP_BASE_WITH_SUBSETS SBWS
WHERE BWS.SAMPLE_ID = SBWS.SAMPLE_ID
FOR XML PATH('')
),1,1,'') AS SUB_SET
FROM #TMP_BASE_WITH_SUBSETS BWS
GROUP BY BWS.SAMPLE_ID;
Sample output
SAMPLE_ID SUB_SET
---------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 1182,1185,1287,1303,1347,136,1702,2064,2088,2178,2328,238,2388,2410,2519,2553,2586,2628,2694,2757,2772,2888,3113,3277,3345,3406,3414,3542,3873,3908,3909,3995,4081,4225,4247,4348,4442,4464,4544,4955,4960,5121,5542,5544,5673,5882,5959,6120,6188,6838,6851,698
2 1585,16,1875,2771,5113,5377,5531,6486,694,786,8477,8570
3 1025,1100,1181,1212,1227,1296,1418,1419,1866,1874,188,2009,2049,2261,2325,244,2442,2473,2547,3061,3096,3235,3247,3343,3356,3563,3798,3944,4092,4110,4311,438,4426,4539,4601,4773,4809,4836,4929,4955,4972,4976,4996,5178,5196,5250,53,5400,5582,5605,5791,5807,6
4 1198,1545,1738,1773,1817,1819,2109,2174,2313,2316,2456,2589,279,3419,3782,4076,4139,422,4347,4436,4457,4536,4565,4816,4971,5057,5288,5404,5450,5686,5710,6258,6623,6685,6691,67,6744,6874,696,71,7289,7440,7505,7552,7902,8199,8250,8390,850,8526,8688,893,9151,
5 1151,119,1298,1380,1394,1478,1551,1552,1603,1702,1989,1996,2158,2211,2278,233,254,2563,2712,2748,2763,2882,2889,3028,3075,3276,3385,3410,3470,4006,4111,4151,4298,4579,4650,5084,5279,5296,5347,5488,5792,5883,5898,6018,6223,636,6374,6381,6399,6531,6674,6676,
6 128,1367,1662,2189,2717,3537,3660,4307,4422,6662,7956,8131,8975
7 1322,140,1534,1551,1681,2292,241,2431,2628,2649,2679,2815,284,2949,30,3159,322,3264,3852,413,4440,4523,4670,4936,4937,495,5005,5048,5056,5099,5964,6079,6331,6446,6693,694,7414,8126,8178,870,9089,9238,9241,9262,9369,9474,9610,9650,9815
8 1119,1446,1596,1635,1700,1788,1809,2003,2186,2259,2483,2646,270,2721,2791,2851,2880,2892,2966,3024,3153,3160,3200,3212,3213,3299,3309,3467,3496,3554,3757,3767,4178,4188,4400,447,4485,450,5035,504,53,5466,5531,5612,5675,5765,5899,5902,6037,6043,6045,6100,63
9 1093,12,1212,1326,1417,1474,1604,1652,1762,1829,1871,190,1925,2090,2155,2692,2790,2797,2925,3062,3077,315,3543,3845,42,4599,4997,5074,5149,5202,5222,5233,5521,5554,5663,5797,5952,6148,6173,6203,626,6275,635,6534,6591,6600,6711,6968,7013,711,7311,7319,7374,
10 1719,1820,2325,2578,3426,3560,3666,3806,3817,3875,4149,4726,4776,5530,5641,6185,6403,7118,7340,740,7401,7520,7645,8293,8472,8483,8564,9054,907,9691,9747
Edit: added the second solution using a temporary table.
June 26, 2016 at 7:26 am
June 26, 2016 at 9:52 am
Hi Erikur,
Thanks again for the reply. To preface this, your post was VERY helpful, and I doubt I would have got to the version I'm happy with now without your suggestions. I'll just say where I had to make some changes. Yours didn't do quite what I wanted right out of the box, but it gave me enough idea to get it working for what I needed. For what it's worth, when I ran your temp table version ran for about 50 minutes on my machine before I just killed it. Truthfully I didn't to a lot of profiling on it because there were a few other things I wanted to rewrite anyway. While using a random number mod @ENTRY_NUMBER is a really clever way of getting symbols, it can produce duplicates in a single set. And, unfortunately in my real set, the numbers are not uniformly distributed between 1 and 10000. Not a big deal, and given the information I gave you, perfectly valid assumptions.
After working through your sample code, here's what I came up with. All in all, it took about a 5 minutes to generate 1 million samples - start to finish. I'ts pretty disk intensive on tempdb (part of that may be the setup I have at home), but memory and memory never got too bad.
Note: I already had a persisted tally table (in my script, sysmon.dbo.numbers; I have one on ever server I ever use) so I don't use an in-line tally table
I also tried batching various steps like pulling the symbols, or serializing them, but all took longer, and didn't seem to make a sizable impact on system resources.
use tempdb
go
set nocount on
go
declare
@universeSize int = 10000,
@maxSubsetSize int = 100,
@maxSubsets int = 1000000
if object_id('tempdb.dbo.#Symbols') is not null drop table #Symbols
create table #Symbols
(
SymbolId int primary key clustered
)
if object_id('tempdb.dbo.#Samples') is not null drop table #Samples
create table #Samples
(
SampleId int primary key clustered,
SampleSize int,
Samples nvarchar(max)
)
if object_id('tempdb.dbo.#SampleSymbols') is not null drop table #SampleSymbols
create table #SampleSymbols
(
SampleId int,
SymbolId int
--primary key clustered (SampleId, SymbolId) with (data_compression = page)
)
-- This will normally be derived from a real table, so you can't assume uniform distribution, or even that all the values will be lower than some number.
-- Get a list of
insert into #Symbols (SymbolID)
select top 10000 SymbolID = num
from sysmon.dbo.numbers
order by newid() -- Just to give the list an air of randomness, so you don't try to use tricks involving sequential numbers. Not strictly necessary at all.
-- Insert an integer identifer for each sample you want to generate, plus a random number between 1 and @maxSubsetSize saying how many symbols should be in the sample
-- This serves as the framework for each group. SampleID is the identifier ofa given group
-- SampleSize is how mny rows it should randomly take off the symbol list
insert into #Samples
(
SampleID,
SampleSize
)
select top (@maxSubsets)
SampleID = num,
SampleSize = (abs(checksum(newid())) % @maxSubsetSize) + 1
from sysmon.dbo.numbers
-- DON PUT A CLUSTERED INDEX ON THIS. ITS WAAAAAAAAY FASTER INTO A HEAP
-- For each sample, cross apply a randomized subset of #Symbols with a top count = [SampleSize] (which we set in the previou statement)
insert into #SampleSymbols
(
SampleID,
SymbolId
)
select
s.SampleId,
x.SymbolId
from #Samples s
cross apply (select top (s.SampleSize) SymbolId
from #Symbols
order by newid()) x
--1.5 min on on about 50m rows
-- Cluster one up homie
-- Note I tried this witout clustering the #SampleSymbols table and it did horribly
create unique clustered index #PKC__#SampleSymbols on #SampleSymbols (SampleID, SymbolID)
with (Data_compression = none, maxdop = 4, online = off)
-- Serialize data
update o
set Samples = '[' + stuff((select ',{XID:' + cast(SymbolId as varchar(30)) + '}'
from #SampleSymbols i
where i.SampleId = o.sampleId
for xml path('')), 1, 1, '') + ']'
from #Samples o
June 27, 2016 at 7:23 am
Hi Xedni,
you are very welcome. I do see one problem with your code and that is that all subsets of the same size will be identical (duplicates), is this what you intended?
😎
June 27, 2016 at 8:49 am
June 27, 2016 at 7:13 pm
This would be much shorter version:
SELECT n.N ID, STUFF(
(SELECT ',' + CONVERT(VARCHAR(20), n3.RandomN)
FROM
--picking a random quantity of random numbers per ID
(SELECT TOP 1 N topn FROM dbo.TallyGenerator(1,3,1) WHERE n.N > 0 ORDER BY NEWID() ) n2
-- now picking the random numbers themselves, [topn] of different numbers per each ID
CROSS APPLY (SELECT TOP (topn) N RandomN FROM dbo.TallyGenerator(1,1000,1) WHERE n.N > 0 ORDER BY NEWID()) n3
FOR XML PATH('')
), 1,1,'')
--generating ID's - may be coming from a table with actual set of ID's, does not have to be generated
FROM dbo.TallyGenerator(1,10000,1) n --(SELECT id FROM #Numbers GROUP BY id) n
ORDER BY id
dbo.TallyGenerator here is an inline table function generating sequential numbers from @Param1 to @Param2 with step @Param3.
May be replaced with the commonly used CTE.
_____________
Code for TallyGenerator
June 27, 2016 at 7:22 pm
June 27, 2016 at 8:04 pm
Xedni (6/27/2016)
Could you provide the code for TallyGenerator? Since I'm not actually trying to just generate random numbers, it's difficult to see where I'd slot in my legitimate symbols, or how it would scale.
One of the versions is right there, in Erikur's post:
-- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE
;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))
, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)
It uses a single parameter (numbers to ..) .
Parameters "numbers from" and "increment" are hardcoded as 1.
Put this code into a function - and here you go.
_____________
Code for TallyGenerator
June 27, 2016 at 8:16 pm
That's how you do it:
CREATE FUNCTION dbo.TallyGenerator (
@EndNumber bigint
)
RETURNS TABLE
AS RETURN
(-- Inline Tally table producing a number sequence from 1 to @EndNumber
WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))
SELECT TOP(@EndNumber) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9
)
GO
SELECT * FROM dbo.TallyGenerator(15) tg
_____________
Code for TallyGenerator
June 27, 2016 at 8:43 pm
Gotcha. Unfortunately, that function doesn't have the same signature as the one in your original code sample. I modified the function to accept all three parameters (start, end and increment; num % @increment, while still being an inline tvf), but cancelled it after 5 minutes. It's an interesting (and informative) approach, however it would seem simplicity in this case doesn't equate to performance.
June 27, 2016 at 11:42 pm
Xedni (6/27/2016)
Gotcha. Unfortunately, that function doesn't have the same signature as the one in your original code sample. I modified the function to accept all three parameters (start, end and increment; num % @increment, while still being an inline tvf), but cancelled it after 5 minutes. It's an interesting (and informative) approach, however it would seem simplicity in this case doesn't equate to performance.
That's odd.
How many sets are your trying to generate?
It took 7 seconds on my laptop to generate 10k sets of up to 30 numbers from the selection of 1000.
_____________
Code for TallyGenerator
June 28, 2016 at 12:00 am
Here is the code I used:
CREATE FUNCTION dbo.TallyN (
@EndNumber bigint
)
RETURNS TABLE
AS RETURN
(-- Inline Tally table producing a number sequence from 1 to @SAMPLE_SIZE
WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))
SELECT TOP(@EndNumber) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9
)
GO
SELECT n.N ID, STUFF(
(SELECT ',' + CONVERT(VARCHAR(20), n3.RandomN)
FROM
--picking a random quantity of random numbers per ID
(SELECT TOP 1 N topn FROM dbo.TallyN(30) WHERE n.N > 0 ORDER BY NEWID() ) n2
-- now picking the random numbers themselves, [topn] of different numbers per each ID
CROSS APPLY (SELECT TOP (topn) N RandomN FROM dbo.TallyN(1000) WHERE n.N > 0 ORDER BY NEWID()) n3
FOR XML PATH('')
), 1,1,'')
--generating ID's - may be coming from a table with actual set of ID's, does not have to be generated
FROM dbo.TallyN(10000) n
ORDER BY id
Execution time - 7 seconds.
_____________
Code for TallyGenerator
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply