January 12, 2012 at 9:48 am
I read (one of?) Jeff Moden's Exceptional DBA interviews and he alludes to a script/method for creating a million rows of test data.
I'd like to know how to do that and I'm wondering if anyone here knows what he might be referring to.
http://www.simple-talk.com/opinion/opinion-pieces/jeff-moden-dba-of-the-day/
The reference is just after the question where the interviewer asks:
"What are the most common mistakes and assumptions which crop up again and again?"
January 12, 2012 at 9:53 am
There are a number of ways to do it. Far as I know, Jeff is actually working on an article about them.
One of the easiest, if the tables are relatively simple, is to use NewID to generate random (junk) data, and insert that into columns. You can usually get a million-row Select statement by joining sys.columns to itself, and use NewID or other functions to generate columns on those rows.
select top 100000 NewID(), NewID()
from sys.columns
cross join sys.columns;
(Don't run that on a production server. It can take a few resources up if the server is under load.)
Another way is to get something like RedGate's data generator. Click the ad at the top-right of this page, and you'll find that there. It's great when you have more complex tables and schemas to populate, like multiple sub-tables and lookup tables.
Those are the methods I usually use for that kind of thing.
Numbers tables (Jeff calls them Tally tables) and Calendar tables, can also be really useful for generating lots of rows of pseudo-data.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 12, 2012 at 9:59 am
GSquared (1/12/2012)
There are a number of ways to do it. Far as I know, Jeff is actually working on an article about them.One of the easiest, if the tables are relatively simple, is to use NewID to generate random (junk) data, and insert that into columns. You can usually get a million-row Select statement by joining sys.columns to itself, and use NewID or other functions to generate columns on those rows.
select top 100000 NewID(), NewID()
from sys.columns
cross join sys.columns;
(Don't run that on a production server. It can take a few resources up if the server is under load.)
Another way is to get something like RedGate's data generator. Click the ad at the top-right of this page, and you'll find that there. It's great when you have more complex tables and schemas to populate, like multiple sub-tables and lookup tables.
Those are the methods I usually use for that kind of thing.
Numbers tables (Jeff calls them Tally tables) and Calendar tables, can also be really useful for generating lots of rows of pseudo-data.
I needed one column to be a space-separated string, so for that I used REPLACE(CAST(NewID() as Nvarchar(255)), '-', ' ') and it seems like this will work great for the time being. Thank you very much.
January 12, 2012 at 10:00 am
Here's the standard test environment I use, normally customised slightly to fit the issue at hand.
--Standard TestEnvironment of 1,000,000 rows of random-ish data
IF object_id('tempdb..#testEnvironment') IS NOT NULL
BEGIN
DROP TABLE #testEnvironment
END
--1,000,000 Random rows of data
SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,
RAND(CHECKSUM(NEWID())) * 30000 + CAST('1945' AS DATETIME) AS randomDate,
ABS(CHECKSUM(NEWID())) AS randomBigInt,
(ABS(CHECKSUM(NEWID())) % 100) + 1 AS randomSmallInt,
RAND(CHECKSUM(NEWID())) * 100 AS randomSmallDec,
RAND(CHECKSUM(NEWID())) AS randomTinyDec,
RAND(CHECKSUM(NEWID())) * 100000 AS randomBigDec,
CONVERT(VARCHAR(6),CONVERT(MONEY,RAND(CHECKSUM(NEWID())) * 100),0) AS randomMoney
INTO #testEnvironment
FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3
PRINT '========== BASELINE =========='
SET STATISTICS TIME ON
SELECT COUNT(*) FROM #testEnvironment
SET STATISTICS TIME OFF
PRINT REPLICATE('=',80)
Obviously entirely based on Jeff Moden's articles / scripts / posts π
January 12, 2012 at 10:02 am
GSquared (1/12/2012)
There are a number of ways to do it. Far as I know, Jeff is actually working on an article about them.
Many of us are looking forward to that article. I'm sure it will get done someday;-)
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
January 12, 2012 at 10:17 am
i have a very slight variation of Cadavre's same table; i keep it as a static table, and use it for masking data for testing.
Lowell
January 12, 2012 at 10:20 am
Wow, helluva response! :w00t: Thanks everyone for your input.
In the last response, what do you mean by "masking" data?
January 12, 2012 at 11:56 am
Also if anyone's interested, I did this to create a column that has one of three values:
SELECT TOP 1 INUPD
FROM
(SELECT NEWID() AS Random, 'INSERT' AS INUPD
UNION ALL
SELECT NEWID(), 'UPDATE'
UNION ALL
SELECT NEWID(), 'DELETE') A
ORDER BY Random
January 13, 2012 at 9:06 pm
aurato (1/12/2012)
I needed one column to be a space-separated string, so for that I used REPLACE(CAST(NewID() as Nvarchar(255)), '-', ' ') and it seems like this will work great for the time being. Thank you very much.
I recommend not creating such a test column in that fashion. In most cases, it doesn't represent the "real world" because it's a rare thing to have spaces in the same position across a large number of rows. In fact, it can really bias performance testing because SQL Server has a tendency to "find the groove" when such data is present.
Here's a programmable space-delimited data generator. It's the code from the "Tally OH" article which has been modified to be space-delimited instead of comma delimited.
--=====================================================================================================================
-- Conditionally drop and recreate a View that will allow us to use NEWID() within a function so we can make
-- Random numbers in a function and create a function that will create constrained randomized CSV element rows.
--=====================================================================================================================
--===== Conditionally drop the objects in the code below to make reruns easier
IF OBJECT_ID('TempDB.dbo.iFunction' ,'V' ) IS NOT NULL DROP VIEW dbo.iFunction;
IF OBJECT_ID('TempDB.dbo.CreateCsv8K','IF') IS NOT NULL DROP FUNCTION dbo.CreateCsv8K;
GO
CREATE VIEW dbo.iFunction AS
/**********************************************************************************************************************
Purpose:
This view is callable from UDF's which allows us to indirectly get a NEWID() within a function where we can't do such
a thing directly in the function. This view also solves the same problem for GETDATE().
Usage:
SELECT MyNewID FROM dbo.iFunction; --Returns a GUID
SELECT MyDate FROM dbo.iFunction; --Returns a Date
Revision History:
Rev 00 - 06 Jun 2004 - Jeff Moden - Initial creation
Rev 01 - 06 Mar 2011 - Jeff Moden - Formalize code. No logic changes.
**********************************************************************************************************************/
SELECT MyNewID = NEWID(),
MyDate = GETDATE();
GO
CREATE FUNCTION dbo.CreateSpaceDelimited8K
/**********************************************************************************************************************
Purpose:
Create a space delimited result with a programable number of rows, elements per row, minimum # of characters per
element, and maximum characters per element. The element size is random in nature constrained by the min and max
characters per element.
Usage:
SELECT * FROM dbo.CreateSpaceDelimited8K(@pNumberOfRows, @pNumberOfElementsPerRow, @pMinElementwidth, @pMaxElementWidth)
Dependencies:
1. View: dbo.iFunction (Produces a NEWID() usable from within a UDF)
Programmer's Notes:
1. The randomness of the elements prevents the delimiters for showing up in the same position for each row so that
SQL Server won't figure that out and cache the information making some splitting techniques seem faster than they
really are.
2. No validation or constraints have been place on the input parameters so use with caution. This code can generate
a lot of data in a couple of heart beats.
Revision History:
Rev 00 - 11 May 2007 - Jeff Moden - Initial creation - Only returned one row and wasn't programmable.
Rev 01 - 26 Jul 2009 - Jeff Moden - Added programmable variables but would only go to 20 characters wide.
Rev 02 - 06 Mar 2011 - Jeff Moden - Converted to iTVF, added minimum element width, and made it so elements can be
virtually any size.
Rev 03 - 13 Jan 2012 - Jeff Moden - Modified to produce space delimited instead of comma delimited.
**********************************************************************************************************************/
--===== Declare the I/0
(
@pNumberOfRows INT,
@pNumberOfElementsPerRow INT,
@pMinElementwidth INT,
@pMaxElementWidth INT
)
RETURNS TABLE
AS
RETURN
--===== This creates and populates a test table on the fly containing a
-- sequential column and a randomly generated CSV Parameter column.
SELECT TOP (@pNumberOfRows) --Controls the number of rows in the test table
ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,
CSV =
(--==== This creates each CSV
SELECT CAST(
STUFF( --=== STUFF get's rid of the leading comma
( --=== This builds CSV row with a leading comma
SELECT TOP (@pNumberOfElementsPerRow) --Controls the number of CSV elements in each row
' '
+ LEFT(--==== Builds random length variable within element width constraints
LEFT(REPLICATE('1234567890',CEILING(@pMaxElementWidth/10.0)), @pMaxElementWidth),
ABS(CHECKSUM((SELECT MyNewID FROM dbo.iFunction)))
% (@pMaxElementWidth - @pMinElementwidth + 1) + @pMinElementwidth
)
FROM sys.All_Columns ac3 --Classic cross join pseudo-cursor
CROSS JOIN sys.All_Columns ac4 --can produce row sets up 16 million.
WHERE ac3.Object_ID <> ac1.Object_ID --Without this line, all rows would be the same.
FOR XML PATH('')
)
,1,1,'')
AS VARCHAR(8000))
)
FROM sys.All_Columns ac1 --Classic cross join pseudo-cursor
CROSS JOIN sys.All_Columns ac2 --can produce row sets up 16 million rows
;
GO
SELECT *
FROM dbo.CreateSpaceDelimited8K(1000,10,5,10)
;
In case someone is interested, here's my "standard" million row test table example...
/**********************************************************************************************************************
Purpose:
Create a voluminous test table with various types of highly randomized data.
--Jeff Moden
**********************************************************************************************************************/
--===== Conditionally drop the test table to make reruns easier
IF OBJECT_ID('dbo.JBMTest','U') IS NOT NULL
DROP TABLE dbo.JBMTest
;
--===== Create and populate a 1,000,000 row test table.
-- "SomeID" has a range of 1 to 1,000,000 unique numbers
-- "SomeInt" has a range of 1 to 50,000 non-unique numbers
-- "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
-- "SomeMoney has a range of 0.00 to 100.00 non-unique numbers
-- "SomeDateTime" has a range of >=01/01/2000 and <01/01/2020 non-unique date/times
-- "SomeDate" has a range of >=01/01/2000 and <01/01/2020 non-unique "whole dates"
-- "SomeName" contains random characters at random lengths from 2 to 20 characters
SELECT TOP 1000000
SomeID = IDENTITY(INT,1,1),
SomeInt = ABS(CHECKSUM(NEWID())) % 50000 + 1,
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))
+ CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')),
SomeMoney = CAST(RAND(CHECKSUM(NEWID())) * 100 AS DECIMAL(9,2)), --Note rounding
SomeDateTime = RAND(CHECKSUM(NEWID())) * DATEDIFF(dd,'2000','2020') + CAST('2000' AS DATETIME),
SomeDate = ABS (CHECKSUM(NEWID())) % DATEDIFF(dd,'2000','2020') + CAST('2000' AS DATETIME),
SomeName = RIGHT(NEWID(),ABS(CHECKSUM(NEWID())) % 19 + 2)
INTO dbo.JBMTest
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
ALTER TABLE dbo.JBMTest
ADD CONSTRAINT PK_JBMTest PRIMARY KEY CLUSTERED (SomeID) WITH FILLFACTOR = 90
;
--Jeff Moden
Change is inevitable... Change for the better is not.
January 13, 2012 at 10:51 pm
Regarding:
GSquared (1/12/2012)
Another way is to get something like RedGate's data generator.
For any serious usage, this is a much better approach than hacking something together based on NEWID. There are probably several such tools available, but this is the one I use personally. You can find details at: http://www.red-gate.com/products/sql-development/sql-data-generator/. It even has a free 14-trial so you can see what you would be missing. I have no connection with the company concerned, by the way.
January 15, 2012 at 7:00 pm
SQL Kiwi (1/13/2012)
Regarding:GSquared (1/12/2012)
Another way is to get something like RedGate's data generator.For any serious usage, this is a much better approach than hacking something together based on NEWID. There are probably several such tools available, but this is the one I use personally. You can find details at: http://www.red-gate.com/products/sql-development/sql-data-generator/. It even has a free 14-trial so you can see what you would be missing. I have no connection with the company concerned, by the way.
Agreed, although it's difficult to include such generated data (million rows) in a post or an article. π
--Jeff Moden
Change is inevitable... Change for the better is not.
January 15, 2012 at 10:56 pm
Jeff Moden (1/15/2012)
Agreed, although it's difficult to include such generated data (million rows) in a post or an article. π
A script is certainly compact, but using NEWID means the data (and statistics) will not be exactly reproducible for anyone else. Another option is to create a database for the post or article, and upload that to one of the free file-sharing sites like Media Fire. This way, people don't have to run scripts and will get an exact copy of the data.
Example: backup of an example database containing a 1 million row data table: http://www.mediafire.com/?0q5rsewn5a0tafv
January 16, 2012 at 4:24 am
I absolutely agree... the warning about NEWID() not producing precisely the same information for any two runs is well known and the reminder is much appreciated. Most demonstrations of code can withstand such a thing, though.
Generating the data with a script that uses NEWID() avoids another problem... backwards compatibility for restores. Here's what happens when someone using SQL Server 2005 tries to restore the Example.bak DB you were kind enough to provide.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 16, 2012 at 5:01 am
Jeff Moden (1/16/2012)
Generating the data with a script that uses NEWID() avoids another problem... backwards compatibility for restores. Here's what happens when someone using SQL Server 2005 tries to restore the Example.bak DB you were kind enough to provide.
Certainly. One would probably state version requirements, or provide a zipped 2005 database file (mdf only) or backup. Both of these could generally be restored or attached to any version of SQL Server 2005-2012. I happened to use 2008 R2, and should have mentioned that.
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply