Creating large semi-random data sets without cursors

  • I have the following structure:

    Protocol--(1,M)--enrollment---(1,M)---[Cycle---(1,M)---Symptom]---(M,1)---ToxicityList

    (Okay, it's not entirely accurate - Cycle is actually a column in Symptom, because Cycle is a degenerate dimension..., but now you understand the basic structure...

    /* just a list of Toxicities (symptoms) ... */

    CREATE TABLE ToxicityList(

    ToxicityID INT IDENTITY,

    ToxicityName VARCHAR(25) NOT NULL,

    Grade INT NOT NULL,

    PRIMARY KEY (ToxicityID)

    UNIQUE (ToxicityName, Grade)

    );

    /* instance of an enrolled Patient reporting a symptom */

    CREATE TABLE Symptom(

    sToxicityID INT,

    sEnrollmentID INT,

    Cycle INT

    PRIMARY KEY (sToxicityID, sEnrollmentID, Cycle));

    Now for the challenge...

    For each Enrollment,

    Create a random number of Cycles (between say 10 and 50)

    For each Cycle

    Create a random number of Symptom records

    Is there a painless way to do this? I read Ken Henderson's chapter from Guru's Guide, and it sort of helped, but I want close to random records. The only catch is that once an enrollment records a Symptom where Grade = 5, he's either dead or out of the protocol.

    Would it make any difference if I did all the creating in TempDB and then appended the unique records to my real table? I'm just hoping to not have to log everything etc. It's just sample data so I can test some things. I wrote something that does this in VBA but it's so slow it's practically worthless. For it to generate in the neighborhood of 100K records would take about a full day!

    Any thoughts?

    Thanks!

    Pieter

  • Here's a start... my standard million row test table. There's a special column in the following that you should look for, as well. You'll recognize it. Should be pretty easy for you to apply similar things in a similar fashion to create the "symptoms" for each cycle.

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 100,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Column "SomeCycle"... just for YOU ;-)

    SELECT TOP 1000000

    SomeID = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12),

    SomeCycle = ABS(CHECKSUM(NEWID()))%(50-10+1)+1

    INTO dbo.JBMTest

    FROM Master.sys.All_Columns t1

    CROSS JOIN Master.sys.All_Columns t2; --Lack of join criteria makes this a CROSS-JOIN

    --===== A table is not properly formed unless a Primary Key has been assigned

    -- Takes about 1 second to execute.

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (SomeID);

    If you study the code, especially "your" column, you'll see how it's all very easily constrained to give you only the random data you want. 😉 It's pretty fast, too.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It's also odd that you posted this in "Anything that is NOT about SQL" because this problem is all about SQL. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply