Random Data

Question

Post reply

Random Data

Mark Johnson-176476

SSCommitted

Points: 1603
More actions
November 29, 2007 at 3:45 pm

#201365

I am trying to get a list of random names using something similar to the following query:
DECLARE @FirstNameCount int
SELECT @FirstNameCount = count(*) FROM SampleData..First_Names
SELECT TOP 1000
( SELECT FirstName
FROM SampleData..First_Names
WHERE RecNo = 1 + ABS(CHECKSUM(NEWID())) % @FirstNameCount
) AS FirstName
FROM sysobjects d1, sysobjects d2 -- (any large tables will do)
This works fine in SQL2005 but in SQL2000 I get the same name for all rows returned. The difference in the Execution plan is that SQL 2000 is doing a Stream Aggregate, Assert and Table spool.
StmtText ---------------------------------------------------------------
|--Top(1000)
|--Compute Scalar(DEFINE:([Expr1007]=[Expr1007]))
|--Nested Loops(Left Outer Join)
|--Nested Loops(Inner Join)
| |--Index Scan(OBJECT:([Paceart_Database].[dbo].[sysobjects].[ncsysobjects2] AS [d2]))
| |--Index Scan(OBJECT:([Paceart_Database].[dbo].[sysobjects].[ncsysobjects2] AS [d1]))
|--Table Spool
|--Assert(WHERE:(If ([Expr1006]>1) then 0 else NULL))
|--Stream Aggregate(DEFINE:([Expr1006]=Count(*), [Expr1007]=ANY([FIRST_NAMES].[FirstName])))
|--Clustered Index Seek(OBJECT:([SampleData].[dbo].[FIRST_NAMES].[PK_FIRST_NAMES]), SEEK:([FIRST_NAMES].[RecNo]=1+abs(checksum(newid()))%[@FirstNameCount]) ORDERED FORWARD)
Why is this? Can I do anything to force a plan like SQL 2005?
Thanks in advance for any help.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

Jason Selburg SSC-Insane Points: 24641 More actions · Answer 1

I may be misunderstanding your request, but why not just use NEWID()

SELECT TOP 1000 FirstName

FROM SampleData..First_Names

ORDER BY NEWID()

______________________________________________________________________

Personal Motto: Why push the envelope when you can just open it?

If you follow the direction given HERE[/url] you'll likely increase the number and quality of responses you get to your question.

Jason L. Selburg

Mark Johnson-176476 SSCommitted Points: 1603 More actions · Answer 2

Sorry. I tried to simply the post and may have confused it more.

This is part of a script that inserts data into one of our tables. So I am selecting other data along with the first name. The select for the first name is a subquery and needs to return a single random name from the FIRST_NAMES table but it has to change for each row in the outer select.

Jason Selburg SSC-Insane Points: 24641 More actions · Answer 3

So do you need this to run in 2000 AND 2005?

______________________________________________________________________

Personal Motto: Why push the envelope when you can just open it?

If you follow the direction given HERE[/url] you'll likely increase the number and quality of responses you get to your question.

Jason L. Selburg

Mark Johnson-176476 SSCommitted Points: 1603 More actions · Answer 4

Ideally yes. It currently works on 2005. I do need it to work on SQL 2000 but it does not necessarily need to be the same code. Just the same results.

Jason Selburg SSC-Insane Points: 24641 More actions · Answer 5

Well if you're select a bunch of data and only the first name needs to be random for EACH record inserted. then change my query to read "TOP 1".

Otherwise, you'll need to be more specific on your needs.

______________________________________________________________________

Personal Motto: Why push the envelope when you can just open it?

If you follow the direction given HERE[/url] you'll likely increase the number and quality of responses you get to your question.

Jason L. Selburg

Mark Johnson-176476 SSCommitted Points: 1603 More actions · Answer 6

Here is some sample code and expected results. When I ran this sample code it worked fine in both SQL 2000 and 2005 but when I use my real tables I get a get a table spool on the first name and the same name returned for each row. I am not sure what is causing the different execution plan.

DROP TABLE [dbo].[FIRST_NAMES]

CREATE TABLE [dbo].[FIRST_NAMES] (

[RecNo] [int] NOT NULL,

[FirstName] [varchar] (100) NULL

) ON [PRIMARY]

GO

CREATE UNIQUE CLUSTERED INDEX [PK_FirstNameRecNo] ON [dbo].[FIRST_NAMES]([RecNo]) ON [PRIMARY]

GO

INSERT INTO FIRST_NAMES (RecNo, FirstName) VALUES (1,'Jim')

INSERT INTO FIRST_NAMES (RecNo, FirstName) VALUES (2,'John')

INSERT INTO FIRST_NAMES (RecNo, FirstName) VALUES (3,'Jeff')

CREATE TABLE [dbo].[LAST_NAMES] (

[RecNo] [int] NOT NULL,

[LastName] [varchar] (100) NULL

) ON [PRIMARY]

GO

CREATE UNIQUE CLUSTERED INDEX [PK_LastNameRecNo] ON [dbo].[LAST_NAMES]([RecNo]) ON [PRIMARY]

GO

INSERT INTO LAST_NAMES (RecNo, LastName) VALUES (1,'Smith')

INSERT INTO LAST_NAMES (RecNo, LastName) VALUES (2,'Jone')

INSERT INTO LAST_NAMES (RecNo, LastName) VALUES (3,'Anderson')

DECLARE @FirstNameCount int

DECLARE @LastNameCount int

SELECT @FirstNameCount = count(*) FROM First_Names

SELECT @LastNameCount = count(*) FROM Last_Names

SELECT TOP 10

( SELECT FirstName

FROM First_Names

WHERE RecNo = 1 + ABS(CHECKSUM(NEWID())) % 3 --@FirstNameCount

) AS FirstName,

( SELECT LastName

FROM Last_Names

WHERE RecNo = 1 + ABS(CHECKSUM(NEWID())) % 3 --@LastNameCount

) AS LastName

FROM sysobjects d1, sysobjects d2

Results:

FirstName LastName

----------- ---------

Jeff Anderson

John Anderson

Jim Jone

John Jone

Jeff Smith

John Anderson

Jim Jone

John Smith

John Anderson

Jim Anderson

Incorrect Result:

FirstName LastName

----------- ---------

Jeff Anderson

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 7

Try this instead...

--DROP TABLE dbo.FirstName,dbo.LastName

--===== Create a couple of test tables

CREATE TABLE dbo.FirstName

(

FirstName VARCHAR(100) NOT NULL,

CONSTRAINT PK_FirstName_FirstName

PRIMARY KEY CLUSTERED (FirstName)

)

CREATE TABLE dbo.LastName

(

LastName VARCHAR(100) NOT NULL,

CONSTRAINT PK_LastName_LastName

PRIMARY KEY CLUSTERED (LastName)

)

--===== Populate the test tables with test data

INSERT INTO dbo.FirstName

(FirstName)

SELECT 'Jim' UNION ALL

SELECT 'Bob' UNION ALL

SELECT 'John' UNION ALL

SELECT 'Joe' UNION ALL

SELECT 'Josh' UNION ALL

SELECT 'Matt' UNION ALL

SELECT 'Serqiy' UNION ALL

SELECT 'Steve' UNION ALL

SELECT 'Tony'

INSERT INTO dbo.LastName

(LastName)

SELECT 'Smith' UNION ALL

SELECT 'Jones' UNION ALL

SELECT 'Moden' UNION ALL

SELECT 'Anderson' UNION ALL

SELECT 'Miller' UNION ALL

SELECT 'Macy' UNION ALL

SELECT 'Bobson'

--===== This creates the randomized list of 10 first and last names

-- Run this as often as you like... each run will be different

SELECT TOP 10

fn.FirstName, ln.LastName

FROM dbo.FirstName fn

CROSS JOIN

dbo.LastName ln

ORDER BY NEWID()

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mark Johnson-176476 SSCommitted Points: 1603 More actions · Answer 8

Thanks for the feedback.

That would work but the performance is terrible (I killed it after 10 minutes). A cross join between my first names and last names produces over 450,000,000 records which then have to be sorted.

The main question I had with the post is "Why am I getting the different execution plans on SQL 2000?" It is processing the subquery once and returning the same value for each row in the outer query on SQL 2000 but in 2005 it correctly executes the subquery for each row in the outer query.

SQL 2005 plan:

StmtText

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

|--Top(TOP EXPRESSION:((10)))

|--Compute Scalar(DEFINE:([SampleData].[dbo].[LAST_NAMES].[LastName]=[SampleData].[dbo].[LAST_NAMES].[LastName]))

|--Nested Loops(Left Outer Join)

|--Nested Loops(Inner Join)

| |--Concatenation

| | |--Index Scan(OBJECT:([mssqlsystemresource].[sys].[sysobjrdb].[ncl02f040]))

| | |--Filter(WHERE:(has_access('CO',[master].[sys].[sysschobjs].[id])=(1)))

| | |--Clustered Index Scan(OBJECT:([master].[sys].[sysschobjs].[clst]), WHERE:([master].[sys].[sysschobjs].[nsclass]=(0) AND [master].[sys].[sysschobjs].[pclass]=(1)))

| |--Row Count Spool

| |--Concatenation

| |--Filter(WHERE:(has_access('CO',[master].[sys].[sysschobjs].[id])=(1)))

| | |--Clustered Index Scan(OBJECT:([master].[sys].[sysschobjs].[clst]), WHERE:([master].[sys].[sysschobjs].[nsclass]=(0) AND [master].[sys].[sysschobjs].[pclass]=(1)))

| |--Index Scan(OBJECT:([mssqlsystemresource].[sys].[sysobjrdb].[ncl02f040]))

|--Clustered Index Seek(OBJECT:([SampleData].[dbo].[LAST_NAMES].[PK_LastNameRecno]), SEEK:([SampleData].[dbo].[LAST_NAMES].[RecNo]=(1)+abs(checksum(newid()))%[@LastNameCount]) ORDERED FORWARD)

SQL 2000 Plan:

|--Top(10)

|--Compute Scalar(DEFINE:([Expr1007]=[Expr1007]))

|--Nested Loops(Left Outer Join)

|--Nested Loops(Inner Join)

| |--Index Scan(OBJECT:([Paceart_Database].[dbo].[sysobjects].[ncsysobjects2] AS [d1]))

| |--Row Count Spool

| |--Index Scan(OBJECT:([Paceart_Database].[dbo].[sysobjects].[ncsysobjects2] AS [d2]))

|--Table Spool

|--Assert(WHERE:(If ([Expr1006]>1) then 0 else NULL))

|--Stream Aggregate(DEFINE:([Expr1006]=Count(*), [Expr1007]=ANY([LAST_NAMES].[LastName])))

|--Clustered Index Seek(OBJECT:([SampleData].[dbo].[LAST_NAMES].[PK_LAST_NAMES]), SEEK:([LAST_NAMES].[RecNo]=1+abs(checksum(newid()))%[@LastNameCount]) ORDERED FORWARD)

Mark Johnson-176476 SSCommitted Points: 1603 More actions · Answer 9

Ther query appears to work fine in SQL 2000 until the Last_Names table get more than about 10,000 rows. Then the optimizer changes the execution plan and the query no longer returns the same results.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 10

Try this instead - no cross join. On my machine - 1M records inserted finish in 10 seconds.

declare @g datetime

select @g=getdate()

drop table randomFullname

drop table firstname

drop table lastName

drop table fullRandname

create table randomFullname (rid int identity(1,1) Primary Key, fnID int,lnid int)

create table FirstName (fnID int identity(1,1) Primary Key, FName varchar(100))

create table lastName (lnID int identity(1,1) Primary Key, lName varchar(100))

insert Firstname (fname)

select 'Matt' union all

select 'Jeff' union all

select 'Jason' union all

select 'Mark' union all

select 'Andrew' union all

select 'Bob' union all

select 'Joe' union all

select 'Steve' union all

select 'Kelly' union all

select 'Beverly' union all

select 'Janet' union all

select 'Angela' union all

select 'Brandie' union all

select 'Josephine'

insert Lastname (lname)

select 'Miller' union all

select 'Moden' union all

select 'Smith' union all

select 'Jones' union all

select 'Tarvin' union all

select 'ODoul' union all

select 'Dupont' union all

select 'Kelley' union all

select 'Taylor' union all

select 'Barton'

declare @fncount int

declare @lncount int

select @fncount=count(*) from firstname

select @lncount=count(*) from lastname

insert randomfullname (fnid,lnid)

select top 1000000

cast(rand(checksum(newid()))*(@fncount-1) as int)+1,

cast(rand(checksum(newid()))*(@lncount-1) as int)+1

from Master.dbo.SysColumns t1,

Master.dbo.SysColumns t2

create index ix_rfn1 on randomFullname(fnid,lnid)

select datediff(ms, @g, getdate())

select rid,lname,fname

into fullRandname

from randomFullname r

inner join firstname fn on r.fnid=fn.fnid

inner join lastname ln on r.lnid=ln.lnid

--just to know how long it took

select datediff(ms, @g, getdate())

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Mark Johnson-176476 SSCommitted Points: 1603 More actions · Answer 11

Thanks Matt.

That will work.

I would still like to understand why the optimizer chooses to do a table spool on my query and change how the query functions when the Last Names source table exceeds a certain row count.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 12

As I recall, table spools occur when temporary operations get to a point where the operation happening in mem is spooled to disk so that it doesn't continuously need to be re-evaluated. In your case - it's likely that that cartesian product got past a certain size, and the optimizer thought it would be the best way.

Every time I've seen them - they were a sign that something wasn't joined.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 13

Heh... You weren't supposed to remove the TOP... thought you only wanted ten random names at a time... and I sure didn't know you had a list of over 20,000 names... thought you were setting up some sort of small test.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mark Johnson-176476 SSCommitted Points: 1603 More actions · Answer 14

Sorry Jeff, I do appreciate your feedback. Sometimes trying to simplify the post can confuse the issue. I am taking data from US census on names and street names and randomly piecing it together to get realistic sample data for some performance testing. It all worked great on SQL 2005 but the SQL 2000 optimizer kept getting in my way.

Thanks again for your help.

Mark