Random Data

  • I am trying to get a list of random names using something similar to the following query:

    DECLARE @FirstNameCount int

    SELECT @FirstNameCount = count(*) FROM SampleData..First_Names

    SELECT TOP 1000

    ( SELECT FirstName

    FROM SampleData..First_Names

    WHERE RecNo = 1 + ABS(CHECKSUM(NEWID())) % @FirstNameCount

    ) AS FirstName

    FROM sysobjects d1, sysobjects d2 -- (any large tables will do)

    This works fine in SQL2005 but in SQL2000 I get the same name for all rows returned. The difference in the Execution plan is that SQL 2000 is doing a Stream Aggregate, Assert and Table spool.

    StmtText ---------------------------------------------------------------

    |--Top(1000)

    |--Compute Scalar(DEFINE:([Expr1007]=[Expr1007]))

    |--Nested Loops(Left Outer Join)

    |--Nested Loops(Inner Join)

    | |--Index Scan(OBJECT:([Paceart_Database].[dbo].[sysobjects].[ncsysobjects2] AS [d2]))

    | |--Index Scan(OBJECT:([Paceart_Database].[dbo].[sysobjects].[ncsysobjects2] AS [d1]))

    |--Table Spool

    |--Assert(WHERE:(If ([Expr1006]>1) then 0 else NULL))

    |--Stream Aggregate(DEFINE:([Expr1006]=Count(*), [Expr1007]=ANY([FIRST_NAMES].[FirstName])))

    |--Clustered Index Seek(OBJECT:([SampleData].[dbo].[FIRST_NAMES].[PK_FIRST_NAMES]), SEEK:([FIRST_NAMES].[RecNo]=1+abs(checksum(newid()))%[@FirstNameCount]) ORDERED FORWARD)

    Why is this? Can I do anything to force a plan like SQL 2005?

    Thanks in advance for any help.

  • I may be misunderstanding your request, but why not just use NEWID()

    SELECT TOP 1000 FirstName

    FROM SampleData..First_Names

    ORDER BY NEWID()

    ______________________________________________________________________

    Personal Motto: Why push the envelope when you can just open it?

    If you follow the direction given HERE[/url] you'll likely increase the number and quality of responses you get to your question.

    Jason L. Selburg
  • Sorry. I tried to simply the post and may have confused it more.

    This is part of a script that inserts data into one of our tables. So I am selecting other data along with the first name. The select for the first name is a subquery and needs to return a single random name from the FIRST_NAMES table but it has to change for each row in the outer select.

  • So do you need this to run in 2000 AND 2005?

    ______________________________________________________________________

    Personal Motto: Why push the envelope when you can just open it?

    If you follow the direction given HERE[/url] you'll likely increase the number and quality of responses you get to your question.

    Jason L. Selburg
  • Ideally yes. It currently works on 2005. I do need it to work on SQL 2000 but it does not necessarily need to be the same code. Just the same results.

  • Well if you're select a bunch of data and only the first name needs to be random for EACH record inserted. then change my query to read "TOP 1".

    Otherwise, you'll need to be more specific on your needs.

    ______________________________________________________________________

    Personal Motto: Why push the envelope when you can just open it?

    If you follow the direction given HERE[/url] you'll likely increase the number and quality of responses you get to your question.

    Jason L. Selburg
  • Here is some sample code and expected results. When I ran this sample code it worked fine in both SQL 2000 and 2005 but when I use my real tables I get a get a table spool on the first name and the same name returned for each row. I am not sure what is causing the different execution plan.

    DROP TABLE [dbo].[FIRST_NAMES]

    CREATE TABLE [dbo].[FIRST_NAMES] (

    [RecNo] [int] NOT NULL,

    [FirstName] [varchar] (100) NULL

    ) ON [PRIMARY]

    GO

    CREATE UNIQUE CLUSTERED INDEX [PK_FirstNameRecNo] ON [dbo].[FIRST_NAMES]([RecNo]) ON [PRIMARY]

    GO

    INSERT INTO FIRST_NAMES (RecNo, FirstName) VALUES (1,'Jim')

    INSERT INTO FIRST_NAMES (RecNo, FirstName) VALUES (2,'John')

    INSERT INTO FIRST_NAMES (RecNo, FirstName) VALUES (3,'Jeff')

    CREATE TABLE [dbo].[LAST_NAMES] (

    [RecNo] [int] NOT NULL,

    [LastName] [varchar] (100) NULL

    ) ON [PRIMARY]

    GO

    CREATE UNIQUE CLUSTERED INDEX [PK_LastNameRecNo] ON [dbo].[LAST_NAMES]([RecNo]) ON [PRIMARY]

    GO

    INSERT INTO LAST_NAMES (RecNo, LastName) VALUES (1,'Smith')

    INSERT INTO LAST_NAMES (RecNo, LastName) VALUES (2,'Jone')

    INSERT INTO LAST_NAMES (RecNo, LastName) VALUES (3,'Anderson')

    DECLARE @FirstNameCount int

    DECLARE @LastNameCount int

    SELECT @FirstNameCount = count(*) FROM First_Names

    SELECT @LastNameCount = count(*) FROM Last_Names

    SELECT TOP 10

    ( SELECT FirstName

    FROM First_Names

    WHERE RecNo = 1 + ABS(CHECKSUM(NEWID())) % 3 --@FirstNameCount

    ) AS FirstName,

    ( SELECT LastName

    FROM Last_Names

    WHERE RecNo = 1 + ABS(CHECKSUM(NEWID())) % 3 --@LastNameCount

    ) AS LastName

    FROM sysobjects d1, sysobjects d2

    Results:

    FirstName LastName

    ----------- ---------

    Jeff Anderson

    John Anderson

    Jim Jone

    John Jone

    Jeff Smith

    John Anderson

    Jim Jone

    John Smith

    John Anderson

    Jim Anderson

    Incorrect Result:

    FirstName LastName

    ----------- ---------

    Jeff Anderson

    Jeff Anderson

    Jeff Anderson

    Jeff Anderson

    Jeff Anderson

    Jeff Anderson

    Jeff Anderson

    Jeff Anderson

    Jeff Anderson

    Jeff Anderson

  • Try this instead...

    --DROP TABLE dbo.FirstName,dbo.LastName

    --===== Create a couple of test tables

    CREATE TABLE dbo.FirstName

    (

    FirstName VARCHAR(100) NOT NULL,

    CONSTRAINT PK_FirstName_FirstName

    PRIMARY KEY CLUSTERED (FirstName)

    )

    CREATE TABLE dbo.LastName

    (

    LastName VARCHAR(100) NOT NULL,

    CONSTRAINT PK_LastName_LastName

    PRIMARY KEY CLUSTERED (LastName)

    )

    --===== Populate the test tables with test data

    INSERT INTO dbo.FirstName

    (FirstName)

    SELECT 'Jim' UNION ALL

    SELECT 'Bob' UNION ALL

    SELECT 'John' UNION ALL

    SELECT 'Joe' UNION ALL

    SELECT 'Josh' UNION ALL

    SELECT 'Matt' UNION ALL

    SELECT 'Serqiy' UNION ALL

    SELECT 'Steve' UNION ALL

    SELECT 'Tony'

    INSERT INTO dbo.LastName

    (LastName)

    SELECT 'Smith' UNION ALL

    SELECT 'Jones' UNION ALL

    SELECT 'Moden' UNION ALL

    SELECT 'Anderson' UNION ALL

    SELECT 'Miller' UNION ALL

    SELECT 'Macy' UNION ALL

    SELECT 'Bobson'

    --===== This creates the randomized list of 10 first and last names

    -- Run this as often as you like... each run will be different

    SELECT TOP 10

    fn.FirstName, ln.LastName

    FROM dbo.FirstName fn

    CROSS JOIN

    dbo.LastName ln

    ORDER BY NEWID()

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the feedback.

    That would work but the performance is terrible (I killed it after 10 minutes). A cross join between my first names and last names produces over 450,000,000 records which then have to be sorted.

    The main question I had with the post is "Why am I getting the different execution plans on SQL 2000?" It is processing the subquery once and returning the same value for each row in the outer query on SQL 2000 but in 2005 it correctly executes the subquery for each row in the outer query.

    SQL 2005 plan:

    StmtText

    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    |--Top(TOP EXPRESSION:((10)))

    |--Compute Scalar(DEFINE:([SampleData].[dbo].[LAST_NAMES].[LastName]=[SampleData].[dbo].[LAST_NAMES].[LastName]))

    |--Nested Loops(Left Outer Join)

    |--Nested Loops(Inner Join)

    | |--Concatenation

    | | |--Index Scan(OBJECT:([mssqlsystemresource].[sys].[sysobjrdb].[ncl02f040]))

    | | |--Filter(WHERE:(has_access('CO',[master].[sys].[sysschobjs].[id])=(1)))

    | | |--Clustered Index Scan(OBJECT:([master].[sys].[sysschobjs].[clst]), WHERE:([master].[sys].[sysschobjs].[nsclass]=(0) AND [master].[sys].[sysschobjs].[pclass]=(1)))

    | |--Row Count Spool

    | |--Concatenation

    | |--Filter(WHERE:(has_access('CO',[master].[sys].[sysschobjs].[id])=(1)))

    | | |--Clustered Index Scan(OBJECT:([master].[sys].[sysschobjs].[clst]), WHERE:([master].[sys].[sysschobjs].[nsclass]=(0) AND [master].[sys].[sysschobjs].[pclass]=(1)))

    | |--Index Scan(OBJECT:([mssqlsystemresource].[sys].[sysobjrdb].[ncl02f040]))

    |--Clustered Index Seek(OBJECT:([SampleData].[dbo].[LAST_NAMES].[PK_LastNameRecno]), SEEK:([SampleData].[dbo].[LAST_NAMES].[RecNo]=(1)+abs(checksum(newid()))%[@LastNameCount]) ORDERED FORWARD)

    SQL 2000 Plan:

    |--Top(10)

    |--Compute Scalar(DEFINE:([Expr1007]=[Expr1007]))

    |--Nested Loops(Left Outer Join)

    |--Nested Loops(Inner Join)

    | |--Index Scan(OBJECT:([Paceart_Database].[dbo].[sysobjects].[ncsysobjects2] AS [d1]))

    | |--Row Count Spool

    | |--Index Scan(OBJECT:([Paceart_Database].[dbo].[sysobjects].[ncsysobjects2] AS [d2]))

    |--Table Spool

    |--Assert(WHERE:(If ([Expr1006]>1) then 0 else NULL))

    |--Stream Aggregate(DEFINE:([Expr1006]=Count(*), [Expr1007]=ANY([LAST_NAMES].[LastName])))

    |--Clustered Index Seek(OBJECT:([SampleData].[dbo].[LAST_NAMES].[PK_LAST_NAMES]), SEEK:([LAST_NAMES].[RecNo]=1+abs(checksum(newid()))%[@LastNameCount]) ORDERED FORWARD)

  • Ther query appears to work fine in SQL 2000 until the Last_Names table get more than about 10,000 rows. Then the optimizer changes the execution plan and the query no longer returns the same results.

  • Try this instead - no cross join. On my machine - 1M records inserted finish in 10 seconds.

    declare @g datetime

    select @g=getdate()

    drop table randomFullname

    drop table firstname

    drop table lastName

    drop table fullRandname

    create table randomFullname (rid int identity(1,1) Primary Key, fnID int,lnid int)

    create table FirstName (fnID int identity(1,1) Primary Key, FName varchar(100))

    create table lastName (lnID int identity(1,1) Primary Key, lName varchar(100))

    insert Firstname (fname)

    select 'Matt' union all

    select 'Jeff' union all

    select 'Jason' union all

    select 'Mark' union all

    select 'Andrew' union all

    select 'Bob' union all

    select 'Joe' union all

    select 'Steve' union all

    select 'Kelly' union all

    select 'Beverly' union all

    select 'Janet' union all

    select 'Angela' union all

    select 'Brandie' union all

    select 'Josephine'

    insert Lastname (lname)

    select 'Miller' union all

    select 'Moden' union all

    select 'Smith' union all

    select 'Jones' union all

    select 'Tarvin' union all

    select 'ODoul' union all

    select 'Dupont' union all

    select 'Kelley' union all

    select 'Taylor' union all

    select 'Barton'

    declare @fncount int

    declare @lncount int

    select @fncount=count(*) from firstname

    select @lncount=count(*) from lastname

    insert randomfullname (fnid,lnid)

    select top 1000000

    cast(rand(checksum(newid()))*(@fncount-1) as int)+1,

    cast(rand(checksum(newid()))*(@lncount-1) as int)+1

    from Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2

    create index ix_rfn1 on randomFullname(fnid,lnid)

    select datediff(ms, @g, getdate())

    select rid,lname,fname

    into fullRandname

    from randomFullname r

    inner join firstname fn on r.fnid=fn.fnid

    inner join lastname ln on r.lnid=ln.lnid

    --just to know how long it took

    select datediff(ms, @g, getdate())

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Thanks Matt.

    That will work.

    I would still like to understand why the optimizer chooses to do a table spool on my query and change how the query functions when the Last Names source table exceeds a certain row count.

  • As I recall, table spools occur when temporary operations get to a point where the operation happening in mem is spooled to disk so that it doesn't continuously need to be re-evaluated. In your case - it's likely that that cartesian product got past a certain size, and the optimizer thought it would be the best way.

    Every time I've seen them - they were a sign that something wasn't joined.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Heh... You weren't supposed to remove the TOP... thought you only wanted ten random names at a time... and I sure didn't know you had a list of over 20,000 names... thought you were setting up some sort of small test.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sorry Jeff, I do appreciate your feedback. Sometimes trying to simplify the post can confuse the issue. I am taking data from US census on names and street names and randomly piecing it together to get realistic sample data for some performance testing. It all worked great on SQL 2005 but the SQL 2000 optimizer kept getting in my way.

    Thanks again for your help.

    Mark

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply