Creating million row test tables?

Question

Creating million row test tables?

aurato

SSCrazy

Points: 2575
More actions
January 12, 2012 at 9:48 am

#253439

I read (one of?) Jeff Moden's Exceptional DBA interviews and he alludes to a script/method for creating a million rows of test data.
I'd like to know how to do that and I'm wondering if anyone here knows what he might be referring to.
http://www.simple-talk.com/opinion/opinion-pieces/jeff-moden-dba-of-the-day/
The reference is just after the question where the interviewer asks:
"What are the most common mistakes and assumptions which crop up again and again?"

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

There are a number of ways to do it. Far as I know, Jeff is actually working on an article about them.

One of the easiest, if the tables are relatively simple, is to use NewID to generate random (junk) data, and insert that into columns. You can usually get a million-row Select statement by joining sys.columns to itself, and use NewID or other functions to generate columns on those rows.

select top 100000 NewID(), NewID()

from sys.columns

cross join sys.columns;

(Don't run that on a production server. It can take a few resources up if the server is under load.)

Another way is to get something like RedGate's data generator. Click the ad at the top-right of this page, and you'll find that there. It's great when you have more complex tables and schemas to populate, like multiple sub-tables and lookup tables.

Those are the methods I usually use for that kind of thing.

Numbers tables (Jeff calls them Tally tables) and Calendar tables, can also be really useful for generating lots of rows of pseudo-data.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

aurato SSCrazy Points: 2575 More actions · Answer 2

GSquared (1/12/2012)
There are a number of ways to do it. Far as I know, Jeff is actually working on an article about them.
One of the easiest, if the tables are relatively simple, is to use NewID to generate random (junk) data, and insert that into columns. You can usually get a million-row Select statement by joining sys.columns to itself, and use NewID or other functions to generate columns on those rows.
select top 100000 NewID(), NewID()
from sys.columns
cross join sys.columns;
(Don't run that on a production server. It can take a few resources up if the server is under load.)
Another way is to get something like RedGate's data generator. Click the ad at the top-right of this page, and you'll find that there. It's great when you have more complex tables and schemas to populate, like multiple sub-tables and lookup tables.
Those are the methods I usually use for that kind of thing.
Numbers tables (Jeff calls them Tally tables) and Calendar tables, can also be really useful for generating lots of rows of pseudo-data.

I needed one column to be a space-separated string, so for that I used REPLACE(CAST(NewID() as Nvarchar(255)), '-', ' ') and it seems like this will work great for the time being. Thank you very much.

Cadavre SSC-Forever Points: 41690 More actions · Answer 3

Here's the standard test environment I use, normally customised slightly to fit the issue at hand.

--Standard TestEnvironment of 1,000,000 rows of random-ish data

IF object_id('tempdb..#testEnvironment') IS NOT NULL

BEGIN

DROP TABLE #testEnvironment

END

--1,000,000 Random rows of data

SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

RAND(CHECKSUM(NEWID())) * 30000 + CAST('1945' AS DATETIME) AS randomDate,

ABS(CHECKSUM(NEWID())) AS randomBigInt,

(ABS(CHECKSUM(NEWID())) % 100) + 1 AS randomSmallInt,

RAND(CHECKSUM(NEWID())) * 100 AS randomSmallDec,

RAND(CHECKSUM(NEWID())) AS randomTinyDec,

RAND(CHECKSUM(NEWID())) * 100000 AS randomBigDec,

CONVERT(VARCHAR(6),CONVERT(MONEY,RAND(CHECKSUM(NEWID())) * 100),0) AS randomMoney

INTO #testEnvironment

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3

PRINT '========== BASELINE =========='

SET STATISTICS TIME ON

SELECT COUNT(*) FROM #testEnvironment

SET STATISTICS TIME OFF

PRINT REPLICATE('=',80)

Obviously entirely based on Jeff Moden's articles / scripts / posts 😀

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

SQLRNNR SSC Guru Points: 281334 More actions · Answer 4

GSquared (1/12/2012)
There are a number of ways to do it. Far as I know, Jeff is actually working on an article about them.

Many of us are looking forward to that article. I'm sure it will get done someday;-)

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Lowell SSC Guru Points: 323518 More actions · Answer 5

i have a very slight variation of Cadavre's same table; i keep it as a static table, and use it for masking data for testing.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

aurato SSCrazy Points: 2575 More actions · Answer 6

Wow, helluva response! :w00t: Thanks everyone for your input.

In the last response, what do you mean by "masking" data?

Devendrakumar SSC-Forever Points: 42493 More actions · Answer 7

@aurato: Thank you very much for posting the link for Jeff's Interview. I owe you.

All@SSC: It’s irrelevant to this thread but can you please provide me pointers for such interviews. Should I start a new thread for it?

aurato SSCrazy Points: 2575 More actions · Answer 8

Also if anyone's interested, I did this to create a column that has one of three values:

SELECT TOP 1 INUPD

FROM

(SELECT NEWID() AS Random, 'INSERT' AS INUPD

UNION ALL

SELECT NEWID(), 'UPDATE'

UNION ALL

SELECT NEWID(), 'DELETE') A

ORDER BY Random

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 9

aurato (1/12/2012)
I needed one column to be a space-separated string, so for that I used REPLACE(CAST(NewID() as Nvarchar(255)), '-', ' ') and it seems like this will work great for the time being. Thank you very much.

I recommend not creating such a test column in that fashion. In most cases, it doesn't represent the "real world" because it's a rare thing to have spaces in the same position across a large number of rows. In fact, it can really bias performance testing because SQL Server has a tendency to "find the groove" when such data is present.

Here's a programmable space-delimited data generator. It's the code from the "Tally OH" article which has been modified to be space-delimited instead of comma delimited.

--=====================================================================================================================

-- Conditionally drop and recreate a View that will allow us to use NEWID() within a function so we can make

-- Random numbers in a function and create a function that will create constrained randomized CSV element rows.

--=====================================================================================================================

--===== Conditionally drop the objects in the code below to make reruns easier

IF OBJECT_ID('TempDB.dbo.iFunction' ,'V' ) IS NOT NULL DROP VIEW dbo.iFunction;

IF OBJECT_ID('TempDB.dbo.CreateCsv8K','IF') IS NOT NULL DROP FUNCTION dbo.CreateCsv8K;

GO

CREATE VIEW dbo.iFunction AS

/**********************************************************************************************************************

Purpose:

This view is callable from UDF's which allows us to indirectly get a NEWID() within a function where we can't do such

a thing directly in the function. This view also solves the same problem for GETDATE().

Usage:

SELECT MyNewID FROM dbo.iFunction; --Returns a GUID

SELECT MyDate FROM dbo.iFunction; --Returns a Date

Revision History:

Rev 00 - 06 Jun 2004 - Jeff Moden - Initial creation

Rev 01 - 06 Mar 2011 - Jeff Moden - Formalize code. No logic changes.

**********************************************************************************************************************/

SELECT MyNewID = NEWID(),

MyDate = GETDATE();

GO

CREATE FUNCTION dbo.CreateSpaceDelimited8K

/**********************************************************************************************************************

Purpose:

Create a space delimited result with a programable number of rows, elements per row, minimum # of characters per

element, and maximum characters per element. The element size is random in nature constrained by the min and max

characters per element.

Usage:

SELECT * FROM dbo.CreateSpaceDelimited8K(@pNumberOfRows, @pNumberOfElementsPerRow, @pMinElementwidth, @pMaxElementWidth)

Dependencies:

1. View: dbo.iFunction (Produces a NEWID() usable from within a UDF)

Programmer's Notes:

1. The randomness of the elements prevents the delimiters for showing up in the same position for each row so that

SQL Server won't figure that out and cache the information making some splitting techniques seem faster than they

really are.

2. No validation or constraints have been place on the input parameters so use with caution. This code can generate

a lot of data in a couple of heart beats.

Revision History:

Rev 00 - 11 May 2007 - Jeff Moden - Initial creation - Only returned one row and wasn't programmable.

Rev 01 - 26 Jul 2009 - Jeff Moden - Added programmable variables but would only go to 20 characters wide.

Rev 02 - 06 Mar 2011 - Jeff Moden - Converted to iTVF, added minimum element width, and made it so elements can be

virtually any size.

Rev 03 - 13 Jan 2012 - Jeff Moden - Modified to produce space delimited instead of comma delimited.

**********************************************************************************************************************/

--===== Declare the I/0

(

@pNumberOfRows INT,

@pNumberOfElementsPerRow INT,

@pMinElementwidth INT,

@pMaxElementWidth INT

)

RETURNS TABLE

AS

RETURN

--===== This creates and populates a test table on the fly containing a

-- sequential column and a randomly generated CSV Parameter column.

SELECT TOP (@pNumberOfRows) --Controls the number of rows in the test table

ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,

CSV =

(--==== This creates each CSV

SELECT CAST(

STUFF( --=== STUFF get's rid of the leading comma

( --=== This builds CSV row with a leading comma

SELECT TOP (@pNumberOfElementsPerRow) --Controls the number of CSV elements in each row

' '

+ LEFT(--==== Builds random length variable within element width constraints

LEFT(REPLICATE('1234567890',CEILING(@pMaxElementWidth/10.0)), @pMaxElementWidth),

ABS(CHECKSUM((SELECT MyNewID FROM dbo.iFunction)))

% (@pMaxElementWidth - @pMinElementwidth + 1) + @pMinElementwidth

)

FROM sys.All_Columns ac3 --Classic cross join pseudo-cursor

CROSS JOIN sys.All_Columns ac4 --can produce row sets up 16 million.

WHERE ac3.Object_ID <> ac1.Object_ID --Without this line, all rows would be the same.

FOR XML PATH('')

)

,1,1,'')

AS VARCHAR(8000))

)

FROM sys.All_Columns ac1 --Classic cross join pseudo-cursor

CROSS JOIN sys.All_Columns ac2 --can produce row sets up 16 million rows

;

GO

SELECT *

FROM dbo.CreateSpaceDelimited8K(1000,10,5,10)

;

In case someone is interested, here's my "standard" million row test table example...

/**********************************************************************************************************************

Purpose:

Create a voluminous test table with various types of highly randomized data.

--Jeff Moden

**********************************************************************************************************************/

--===== Conditionally drop the test table to make reruns easier

IF OBJECT_ID('dbo.JBMTest','U') IS NOT NULL

DROP TABLE dbo.JBMTest

;

--===== Create and populate a 1,000,000 row test table.

-- "SomeID" has a range of 1 to 1,000,000 unique numbers

-- "SomeInt" has a range of 1 to 50,000 non-unique numbers

-- "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

-- "SomeMoney has a range of 0.00 to 100.00 non-unique numbers

-- "SomeDateTime" has a range of >=01/01/2000 and <01/01/2020 non-unique date/times

-- "SomeDate" has a range of >=01/01/2000 and <01/01/2020 non-unique "whole dates"

-- "SomeName" contains random characters at random lengths from 2 to 20 characters

SELECT TOP 1000000

SomeID = IDENTITY(INT,1,1),

SomeInt = ABS(CHECKSUM(NEWID())) % 50000 + 1,

SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))

+ CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')),

SomeMoney = CAST(RAND(CHECKSUM(NEWID())) * 100 AS DECIMAL(9,2)), --Note rounding

SomeDateTime = RAND(CHECKSUM(NEWID())) * DATEDIFF(dd,'2000','2020') + CAST('2000' AS DATETIME),

SomeDate = ABS (CHECKSUM(NEWID())) % DATEDIFF(dd,'2000','2020') + CAST('2000' AS DATETIME),

SomeName = RIGHT(NEWID(),ABS(CHECKSUM(NEWID())) % 19 + 2)

INTO dbo.JBMTest

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

ALTER TABLE dbo.JBMTest

ADD CONSTRAINT PK_JBMTest PRIMARY KEY CLUSTERED (SomeID) WITH FILLFACTOR = 90

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 10

Regarding:

GSquared (1/12/2012)
Another way is to get something like RedGate's data generator.

For any serious usage, this is a much better approach than hacking something together based on NEWID. There are probably several such tools available, but this is the one I use personally. You can find details at: http://www.red-gate.com/products/sql-development/sql-data-generator/. It even has a free 14-trial so you can see what you would be missing. I have no connection with the company concerned, by the way.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 11

SQL Kiwi (1/13/2012)
Regarding:
GSquared (1/12/2012)
Another way is to get something like RedGate's data generator.
For any serious usage, this is a much better approach than hacking something together based on NEWID. There are probably several such tools available, but this is the one I use personally. You can find details at: http://www.red-gate.com/products/sql-development/sql-data-generator/. It even has a free 14-trial so you can see what you would be missing. I have no connection with the company concerned, by the way.

Agreed, although it's difficult to include such generated data (million rows) in a post or an article. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 12

Jeff Moden (1/15/2012)
Agreed, although it's difficult to include such generated data (million rows) in a post or an article. 😀

A script is certainly compact, but using NEWID means the data (and statistics) will not be exactly reproducible for anyone else. Another option is to create a database for the post or article, and upload that to one of the free file-sharing sites like Media Fire. This way, people don't have to run scripts and will get an exact copy of the data.

Example: backup of an example database containing a 1 million row data table: http://www.mediafire.com/?0q5rsewn5a0tafv

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 13

I absolutely agree... the warning about NEWID() not producing precisely the same information for any two runs is well known and the reminder is much appreciated. Most demonstrations of code can withstand such a thing, though.

Generating the data with a script that uses NEWID() avoids another problem... backwards compatibility for restores. Here's what happens when someone using SQL Server 2005 tries to restore the Example.bak DB you were kind enough to provide.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 14

Jeff Moden (1/16/2012)
Generating the data with a script that uses NEWID() avoids another problem... backwards compatibility for restores. Here's what happens when someone using SQL Server 2005 tries to restore the Example.bak DB you were kind enough to provide.

Certainly. One would probably state version requirements, or provide a zipped 2005 database file (mdf only) or backup. Both of these could generally be restored or attached to any version of SQL Server 2005-2012. I happened to use 2008 R2, and should have mentioned that.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi