Return random records in a table-valued function?

  • My overarching goal is to generate sets of random Symptom records for each Enrollee in a drug study, so that for each cycle (period of time), the code will insert a random number of random records for each enrollee.

    I'm trying to return a number of random records from a table, but inside a table-valued function... (which could be my problem).

    CREATE FUNCTION dbo.ufn_GetTopSymptoms (

    @enrollID INT

    , @CTCVersion VARCHAR(20)

    , @NumRecords INT

    )

    RETURNS TABLE

    AS

    RETURN

    SELECT TOP(@NumRecords) ID

    , @enrollID AS EnrolledPatientID

    FROM dbo.Data

    WHERE [Version] = @CTCVersion

    ORDER BY NEWID();

    GO

    but that ORDER BY NEWID() clause is illegal apparently, because here's the error it throws:

    Msg 443, Level 16, State 1, Procedure ufn_GetTopSymptoms, Line 13

    Invalid use of a side-effecting operator 'newid' within a function.

    I was hoping I could return a set of enrollmentIDs and then use CROSS APPLY to generate a random set of records for each enrollmentID... is this not possible with APPLY? I was trying to avoid using a cursor...

    The idea is basically to create all the Symptom records for all the patients in treatment cycle at once by using Enrollee OUTER APPLY dbo.ufn_GetTopSymtoms(dbo.Enrollment.EnrolleeID)

    but that's clearly not working. Is there a way to do this without resorting to a cursor? I saw Paul White's (outer and cross apply) articles, but maybe I misapplied what I read.

    Any pointers on how to do this right?

    Thanks!

    Pieter

  • Unfortunately I don't think you're going to be able to do it within the FUNCTION because of the error it is throwing but there is no reason you can't do it with a CROSS APPLY like this:

    CREATE TABLE #Data

    (

    ID INT

    ,N INT

    );

    CREATE INDEX i1 ON #Data (ID, N);

    CREATE TABLE #Patients

    (

    PatientID INT

    )

    -- Uncomment below to set up 1,000,000 rows

    DECLARE @N INT = 10000; -- 10,000 = ~1,000,000 rows

    WITH Tally (n) AS

    (

    SELECT TOP (@N) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b

    )

    INSERT INTO #Data

    SELECT a.n, ABS(CHECKSUM(NEWID()))%@N

    -- Always create exactly 100 IDs from the Tally table

    FROM (SELECT TOP 100 n FROM Tally) a

    CROSS APPLY

    (

    SELECT TOP (@N) n

    FROM Tally

    )b;

    WITH Tally (n) AS

    (

    SELECT TOP (100) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b

    )

    INSERT INTO #Patients

    SELECT n

    FROM Tally;

    DECLARE @Nrecs INT = 10;

    SELECT PatientID, ID, N

    FROM #Patients a

    CROSS APPLY

    (

    SELECT TOP (@Nrecs) ID, N

    FROM #Data b

    WHERE a.PatientID = b.ID

    ORDER BY NEWID()

    ) b;

    --GO

    --DROP TABLE #Data;

    --DROP TABLE #Patients;

    Note the commented out DROP TABLE statements. You can compare the results of the first run with a second run where you only execute the statements starting with the DECLARE of @Nrecs. When I did it, it seemed to produce a different set of Ns on each run. You may want to add an ORDER BY on the query to ensure you clearly see what is getting returned (and that it is different each time).


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Dwain,

    Good to have a definitive answer. Thanks! Looks like I need a good reasonably simple tutorial on CROSS APPLY. I guess I could start with Paul White's two articles. The one thing I really wanted to be able to do was to create a random number of symptoms for each (enrolleeID, TxCycle) combination... but the catch is that all symptoms in the Data table (hey, I didn't name the thing!) where the Grade=5 are fatal, so if one of those is encountered, I would stop adding symptoms for that enrollee. Do I have to resort to an RBAR solution for that? Essentially, the pseudocode would look like this:

    FOR a series of TimeSlots (Cycles)

    add new enrollees (grab random number of random patients)

    Add New Symptom Records for all enrolled Patients

    Update all Enrolled Patients (enrollees) that are still enrolled (Enrollment.LeaveDate IS NULL) Where there exists a fatal Symptom (grade 5)

    NEXT TimeSlot

    Is this necessary, or can I do some of this with CROSS APPLY?

    Thanks!

    Pieter

  • I'm not sure I'm getting the full meaning of your business rules, this being a Sat morning and me having a hangover. However...

    Looking at the code that I provided, within the CROSS APPLY it is returning always the same number of rows (just different ones). Is what you're looking for to return a variable number of rows, stopping when a value on the row has a certain value and then returning no more rows?

    I think I have a piece of code laying around somewhere that does something like that. I'll need to find it. Never used it inside of a CROSS APPLY but I see no reason why it wouldn't work.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Well, at least you have a reason for your brain not working!

    Yes... you have it right. Think of a patient receiving cancer treatments. He will receive treatments and record "symptoms" until he either drops out of the program or dies (in data terms, he has a grade 5 symptom).

    Thanks!

  • I can generate purely random records really easily. TOP (@N) ... ORDER BY CHECKSUM(NewID)

    The really hard part (for me!) is fixing my code so that as the grade increases, the frequency decreases maybe exponentially.

    This is as far as I got...

    USE scridb;

    GO

    DECLARE @NumRecs INT;

    DECLARE @Lower INT = 1;

    DECLARE @Upper INT = 9;

    SELECT @NumRecs = ROUND(((@Upper - @Lower - 1) * RAND() * @Lower),1);

    /*

    SELECT @NumRecs = ABS(CHECKSUM(NEWID()))/250000000;

    */

    SELECT TOP (@NumRecs)

    Data.ID

    , Data.Grade

    , Data.Category

    , Data.Subcategory

    , Tally.N

    FROM Data INNER JOIN Tally ON (6-Data.Grade)>=Tally.N

    WHERE Grade IS NOT NULL

    AND [Version] = 'CTCAE v4.03 (MedDRAv12.0)'

    ORDER BY NEWID();

    I think the part I'm doing wrong is not adding enough copies of the Tally table so that I get a distribution roughly like this:

    Grade 5: 1 / 1000

    Grade 4: 50 / 1000

    Grade 3: 200 / 1000

    Grade 2: 300 / 1000

    Grade 1: 449/ 1000

  • You can try this:

    CREATE TABLE #Data

    (

    ID INT

    ,N INT

    ,Flag INT

    );

    CREATE INDEX i1 ON #Data (ID, N);

    CREATE TABLE #Patients

    (

    PatientID INT

    )

    -- Uncomment below to set up 1,000,000 rows

    DECLARE @N INT = 10000; -- 10,000 = ~1,000,000 rows

    WITH Tally (n) AS

    (

    SELECT TOP (@N) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b

    )

    INSERT INTO #Data

    SELECT a.n, ABS(CHECKSUM(NEWID()))%@N, CASE WHEN b.n > 4000 THEN 0 ELSE b.n%100 END

    -- Always create exactly 100 IDs from the Tally table

    FROM (SELECT TOP 100 n FROM Tally) a

    CROSS APPLY

    (

    SELECT TOP (@N) n

    FROM Tally

    )b;

    WITH Tally (n) AS

    (

    SELECT TOP (100) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b

    )

    INSERT INTO #Patients

    SELECT n

    FROM Tally;

    DECLARE @Nrecs INT = 100;

    SELECT PatientID, ID, N, Flag

    FROM #Patients a

    CROSS APPLY

    (

    SELECT TOP (@Nrecs) ID, N, Flag

    FROM

    (

    SELECT ID, N, Flag

    FROM (

    SELECT ID, N, Flag, rn

    ,rn0=ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N)

    ,mrn=MIN(rn) OVER (PARTITION BY ID)

    FROM (

    SELECT ID, N, Flag

    -- Flag = 0 is the trigger to stop returning rows

    ,rn=CASE WHEN Flag = 0 THEN ROW_NUMBER() OVER (PARTITION BY ID ORDER BY N) END

    FROM #Data b

    WHERE a.PatientID = b.ID

    ) a

    ) a

    WHERE rn0 <= mrn

    ) b

    ORDER BY NEWID()

    ) b;

    --GO

    --DROP TABLE #Data;

    --DROP TABLE #Patients;

    For my case, when Flag=0 the symptom is fatal.

    Which makes me wonder what constitutes a fatal symptom. Presumably it is a bit more specific than "patient drops dead." 😛


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Dwain,

    First off, thanks for taking the time to crank that out... Of course, it'll take me a while to figure it out!!!

    The grades are from the National Institute of Health - they published a database of Cancer treatment symptoms, and they were the ones to come up with the "1 to 5" grading scale. To me, it looks more like a Richter scale, but that's my opinion! By definition, a "5" in their scale is fatal. (And hence the odd rule... I didn't make it up! I'm nowhere near that creative!)

    In your example, the ID column is analogous to what in mine, a SymptomID? (I think.) So N would correspond to a time value (like a cycle number or week number).... right?

    I didn't make up this craziness... I got it from here: http://www.theradex.com/ctms/downloads.aspx (I just incorporated the database into my stuff, since I figured out that it was a MS Access database in disguise.)

    Thanks!

    Pieter

  • pietlinden (12/22/2013)


    Dwain,

    First off, thanks for taking the time to crank that out... Of course, it'll take me a while to figure it out!!!

    ...

    In your example, the ID column is analogous to what in mine, a SymptomID? (I think.) So N would correspond to a time value (like a cycle number or week number).... right?

    ...

    You're welcome. Like I said, I had the code on file just needed to find and adapt it.

    And I believe the answer to your question is no. My ID is your PatientID. N is the symptom ID.

    The "Flag" column should then be equivalent to your symptom rating.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Okay, thanks for the clarification. I'll post again if I can't figure something else out about the solution.

  • Dwain,

    Okay, must be that the candy canes and eggnog are working... This seems to work okay, and pretty quickly (haven't tested it, but it's less than a second to run the stored procedure.

    Here's some code that works reasonably well... not perfect, but it's fake data...

    The view just lets me use the semi-random number (RandomNo) to generate a different number of Symptom records for each (enrollmentID, cycleNo) combination. It's still flawed, I guess, because I never included any code to grab the MAX(CycleNo) per enrollmentID, but the rest works, and that was the hard part!! Without further ado ...

    CREATE VIEW vw_EnrolledPatients

    AS

    SELECTenrollmentID

    , e_PatientID AS PatientID

    , e_ProtocolNo AS ProtocolNo

    , ROW_NUMBER() OVER(PARTITION BY e_PatientID ORDER BY e_PatientID) AS TxCycle

    , ABS(CHECKSUM(NEWID())%10) AS RandomNo

    FROM Enroll

    WHERE LeaveDate IS NULL;

    -- Generate a random number of Symptoms for each PatientID for random treatment cycles.

    CREATE PROC uspAddSymptoms

    AS

    BEGIN

    -- Generate a random number of Symptoms for each PatientID for random treatment cycles.

    INSERT INTO Symptom(enrollmentID, cycle,SymptomID)

    SELECT EnrollmentID

    , ABS(CHECKSUM(NEWID())%10)+1 AS CycleNo

    , a.ID As SymptomID

    FROM vw_EnrolledPatients

    CROSS APPLY (SELECT TOP(ABS(CHECKSUM(NEWID())%10)) Data.ID

    FROM Data

    WHERE [Version]='CTCAE v4.03 (MedDRAv12.0)'

    AND Grade IS NOT NULL

    ORDER BY NEWID()) a

    ORDER BY enrollmentID, CycleNo, ID;

    -- remove deceased patients

    UPDATE Enroll

    SET LeaveDate = GETDATE()

    WHERE LeaveDate IS NULL -- dont update Patients who have already been removed.

    AND EnrollmentID IN (SELECT enrollmentID

    FROM Symptom INNER JOIN Data ON Symptom.SymptomID = Data.ID

    WHERE Data.Grade = 5); -- <-- Has a fatal symptom

    END

    Thanks!

    Pieter

  • Different route, similar results.

    First some sample data:

    SELECT TOP (50000)

    ID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    [Version] = 'CTCAE v4.03 (MedDRAv12.0)',

    Grade = 1+ABS(CHECKSUM(NEWID()))%5

    INTO Data

    FROM SYSCOLUMNS a, SYSCOLUMNS b

    Second, a view which returns a suitable set from that data:

    CREATE VIEW vwRandomSymptoms AS

    SELECT TOP(1+ABS(CHECKSUM(NEWID()))%10)

    *

    FROM (

    SELECT * FROM (SELECT TOP 449 * FROM Data WHERE [Version] = 'CTCAE v4.03 (MedDRAv12.0)' AND Grade = 1 ORDER BY NEWID()) d

    UNION ALL

    SELECT * FROM (SELECT TOP 300 * FROM Data WHERE [Version] = 'CTCAE v4.03 (MedDRAv12.0)' AND Grade = 2 ORDER BY NEWID()) d

    UNION ALL

    SELECT * FROM (SELECT TOP 200 * FROM Data WHERE [Version] = 'CTCAE v4.03 (MedDRAv12.0)' AND Grade = 3 ORDER BY NEWID()) d

    UNION ALL

    SELECT * FROM (SELECT TOP 50 * FROM Data WHERE [Version] = 'CTCAE v4.03 (MedDRAv12.0)' AND Grade = 4 ORDER BY NEWID()) d

    UNION ALL

    SELECT * FROM (SELECT TOP 1 * FROM Data WHERE [Version] = 'CTCAE v4.03 (MedDRAv12.0)' AND Grade = 5 ORDER BY NEWID()) d

    ) e

    ORDER BY NEWID()

    Third, an inline function to run the view. Note the funny stuff in the WHERE clause to force the optimiser to "run" the function for each outer row rather than once for the query:

    CREATE FUNCTION IF_GetRandomSymptoms

    (

    @Param INT

    )

    RETURNS TABLE

    AS

    RETURN

    (

    SELECT

    v.*,

    CycleNo = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))

    FROM vwRandomSymptoms v

    WHERE v.ID = (v.ID * @Param/@Param)

    )

    Test it like this:

    SELECT *

    FROM (SELECT n = 1 UNION ALL SELECT 2) d

    CROSS APPLY dbo.IF_GetRandomSymptoms (d.n)

    ORDER BY n

    - where derived table d represents your Enroll table. Give it a whirl Pieter, see how you get on.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home (1/1/2014)


    Different route, similar results.

    First some sample data:

    SELECT TOP (50000)

    ID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    [Version] = 'CTCAE v4.03 (MedDRAv12.0)',

    Grade = 1+ABS(CHECKSUM(NEWID()))%5

    INTO Data

    FROM SYSCOLUMNS a, SYSCOLUMNS b

    Second, a view which returns a suitable set from that data:

    CREATE VIEW vwRandomSymptoms AS

    SELECT TOP(1+ABS(CHECKSUM(NEWID()))%10)

    *

    FROM (

    SELECT * FROM (SELECT TOP 449 * FROM Data WHERE [Version] = 'CTCAE v4.03 (MedDRAv12.0)' AND Grade = 1 ORDER BY NEWID()) d

    UNION ALL

    SELECT * FROM (SELECT TOP 300 * FROM Data WHERE [Version] = 'CTCAE v4.03 (MedDRAv12.0)' AND Grade = 2 ORDER BY NEWID()) d

    UNION ALL

    SELECT * FROM (SELECT TOP 200 * FROM Data WHERE [Version] = 'CTCAE v4.03 (MedDRAv12.0)' AND Grade = 3 ORDER BY NEWID()) d

    UNION ALL

    SELECT * FROM (SELECT TOP 50 * FROM Data WHERE [Version] = 'CTCAE v4.03 (MedDRAv12.0)' AND Grade = 4 ORDER BY NEWID()) d

    UNION ALL

    SELECT * FROM (SELECT TOP 1 * FROM Data WHERE [Version] = 'CTCAE v4.03 (MedDRAv12.0)' AND Grade = 5 ORDER BY NEWID()) d

    ) e

    ORDER BY NEWID()

    Third, an inline function to run the view. Note the funny stuff in the WHERE clause to force the optimiser to "run" the function for each outer row rather than once for the query:

    CREATE FUNCTION IF_GetRandomSymptoms

    (

    @Param INT

    )

    RETURNS TABLE

    AS

    RETURN

    (

    SELECT

    v.*,

    CycleNo = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))

    FROM vwRandomSymptoms v

    WHERE v.ID = (v.ID * @Param/@Param)

    )

    Test it like this:

    SELECT *

    FROM (SELECT n = 1 UNION ALL SELECT 2) d

    CROSS APPLY dbo.IF_GetRandomSymptoms (d.n)

    ORDER BY n

    - where derived table d represents your Enroll table. Give it a whirl Pieter, see how you get on.

    I see that you are raising the bar for the New Year Chris! Nice one +1.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • You can also just create a view called (for example) ANewID that generates a GUID as a single column, single row result and simply join to it as a CROSS JOIN or CROSS APPLY.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    I think that's how I got the random number of records per enrollment so that the distribution is irregular. I noticed that Chris's code returns the same number of results per Protocol. I guess this is where I have to re-read parts of Guru's Guide to T-SQL - might be an old book, but it's easy to follow and he deals with a lot of the standard problems in T-SQL (like this one).

    Think I'll read that and post back. Otherwise I may have to resort to the dreaded cursor... (might have to try it just to see how bad it is!) Might work okay as I only need it to run once ...

    Pieter

Viewing 15 posts - 1 through 15 (of 34 total)

You must be logged in to reply to this topic. Login to reply