Creating million row test tables?

  • I read (one of?) Jeff Moden's Exceptional DBA interviews and he alludes to a script/method for creating a million rows of test data.

    I'd like to know how to do that and I'm wondering if anyone here knows what he might be referring to.

    http://www.simple-talk.com/opinion/opinion-pieces/jeff-moden-dba-of-the-day/

    The reference is just after the question where the interviewer asks:

    "What are the most common mistakes and assumptions which crop up again and again?"

  • There are a number of ways to do it. Far as I know, Jeff is actually working on an article about them.

    One of the easiest, if the tables are relatively simple, is to use NewID to generate random (junk) data, and insert that into columns. You can usually get a million-row Select statement by joining sys.columns to itself, and use NewID or other functions to generate columns on those rows.

    select top 100000 NewID(), NewID()

    from sys.columns

    cross join sys.columns;

    (Don't run that on a production server. It can take a few resources up if the server is under load.)

    Another way is to get something like RedGate's data generator. Click the ad at the top-right of this page, and you'll find that there. It's great when you have more complex tables and schemas to populate, like multiple sub-tables and lookup tables.

    Those are the methods I usually use for that kind of thing.

    Numbers tables (Jeff calls them Tally tables) and Calendar tables, can also be really useful for generating lots of rows of pseudo-data.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (1/12/2012)


    There are a number of ways to do it. Far as I know, Jeff is actually working on an article about them.

    One of the easiest, if the tables are relatively simple, is to use NewID to generate random (junk) data, and insert that into columns. You can usually get a million-row Select statement by joining sys.columns to itself, and use NewID or other functions to generate columns on those rows.

    select top 100000 NewID(), NewID()

    from sys.columns

    cross join sys.columns;

    (Don't run that on a production server. It can take a few resources up if the server is under load.)

    Another way is to get something like RedGate's data generator. Click the ad at the top-right of this page, and you'll find that there. It's great when you have more complex tables and schemas to populate, like multiple sub-tables and lookup tables.

    Those are the methods I usually use for that kind of thing.

    Numbers tables (Jeff calls them Tally tables) and Calendar tables, can also be really useful for generating lots of rows of pseudo-data.

    I needed one column to be a space-separated string, so for that I used REPLACE(CAST(NewID() as Nvarchar(255)), '-', ' ') and it seems like this will work great for the time being. Thank you very much.

  • Here's the standard test environment I use, normally customised slightly to fit the issue at hand.

    --Standard TestEnvironment of 1,000,000 rows of random-ish data

    IF object_id('tempdb..#testEnvironment') IS NOT NULL

    BEGIN

    DROP TABLE #testEnvironment

    END

    --1,000,000 Random rows of data

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

    RAND(CHECKSUM(NEWID())) * 30000 + CAST('1945' AS DATETIME) AS randomDate,

    ABS(CHECKSUM(NEWID())) AS randomBigInt,

    (ABS(CHECKSUM(NEWID())) % 100) + 1 AS randomSmallInt,

    RAND(CHECKSUM(NEWID())) * 100 AS randomSmallDec,

    RAND(CHECKSUM(NEWID())) AS randomTinyDec,

    RAND(CHECKSUM(NEWID())) * 100000 AS randomBigDec,

    CONVERT(VARCHAR(6),CONVERT(MONEY,RAND(CHECKSUM(NEWID())) * 100),0) AS randomMoney

    INTO #testEnvironment

    FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3

    PRINT '========== BASELINE =========='

    SET STATISTICS TIME ON

    SELECT COUNT(*) FROM #testEnvironment

    SET STATISTICS TIME OFF

    PRINT REPLICATE('=',80)

    Obviously entirely based on Jeff Moden's articles / scripts / posts πŸ˜€


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • GSquared (1/12/2012)


    There are a number of ways to do it. Far as I know, Jeff is actually working on an article about them.

    Many of us are looking forward to that article. I'm sure it will get done someday;-)

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • i have a very slight variation of Cadavre's same table; i keep it as a static table, and use it for masking data for testing.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Wow, helluva response! :w00t: Thanks everyone for your input.

    In the last response, what do you mean by "masking" data?

  • @aurato: Thank you very much for posting the link for Jeff's Interview. I owe you.

    All@SSC: It’s irrelevant to this thread but can you please provide me pointers for such interviews. Should I start a new thread for it?

  • Also if anyone's interested, I did this to create a column that has one of three values:

    SELECT TOP 1 INUPD

    FROM

    (SELECT NEWID() AS Random, 'INSERT' AS INUPD

    UNION ALL

    SELECT NEWID(), 'UPDATE'

    UNION ALL

    SELECT NEWID(), 'DELETE') A

    ORDER BY Random

  • aurato (1/12/2012)


    I needed one column to be a space-separated string, so for that I used REPLACE(CAST(NewID() as Nvarchar(255)), '-', ' ') and it seems like this will work great for the time being. Thank you very much.

    I recommend not creating such a test column in that fashion. In most cases, it doesn't represent the "real world" because it's a rare thing to have spaces in the same position across a large number of rows. In fact, it can really bias performance testing because SQL Server has a tendency to "find the groove" when such data is present.

    Here's a programmable space-delimited data generator. It's the code from the "Tally OH" article which has been modified to be space-delimited instead of comma delimited.

    --=====================================================================================================================

    -- Conditionally drop and recreate a View that will allow us to use NEWID() within a function so we can make

    -- Random numbers in a function and create a function that will create constrained randomized CSV element rows.

    --=====================================================================================================================

    --===== Conditionally drop the objects in the code below to make reruns easier

    IF OBJECT_ID('TempDB.dbo.iFunction' ,'V' ) IS NOT NULL DROP VIEW dbo.iFunction;

    IF OBJECT_ID('TempDB.dbo.CreateCsv8K','IF') IS NOT NULL DROP FUNCTION dbo.CreateCsv8K;

    GO

    CREATE VIEW dbo.iFunction AS

    /**********************************************************************************************************************

    Purpose:

    This view is callable from UDF's which allows us to indirectly get a NEWID() within a function where we can't do such

    a thing directly in the function. This view also solves the same problem for GETDATE().

    Usage:

    SELECT MyNewID FROM dbo.iFunction; --Returns a GUID

    SELECT MyDate FROM dbo.iFunction; --Returns a Date

    Revision History:

    Rev 00 - 06 Jun 2004 - Jeff Moden - Initial creation

    Rev 01 - 06 Mar 2011 - Jeff Moden - Formalize code. No logic changes.

    **********************************************************************************************************************/

    SELECT MyNewID = NEWID(),

    MyDate = GETDATE();

    GO

    CREATE FUNCTION dbo.CreateSpaceDelimited8K

    /**********************************************************************************************************************

    Purpose:

    Create a space delimited result with a programable number of rows, elements per row, minimum # of characters per

    element, and maximum characters per element. The element size is random in nature constrained by the min and max

    characters per element.

    Usage:

    SELECT * FROM dbo.CreateSpaceDelimited8K(@pNumberOfRows, @pNumberOfElementsPerRow, @pMinElementwidth, @pMaxElementWidth)

    Dependencies:

    1. View: dbo.iFunction (Produces a NEWID() usable from within a UDF)

    Programmer's Notes:

    1. The randomness of the elements prevents the delimiters for showing up in the same position for each row so that

    SQL Server won't figure that out and cache the information making some splitting techniques seem faster than they

    really are.

    2. No validation or constraints have been place on the input parameters so use with caution. This code can generate

    a lot of data in a couple of heart beats.

    Revision History:

    Rev 00 - 11 May 2007 - Jeff Moden - Initial creation - Only returned one row and wasn't programmable.

    Rev 01 - 26 Jul 2009 - Jeff Moden - Added programmable variables but would only go to 20 characters wide.

    Rev 02 - 06 Mar 2011 - Jeff Moden - Converted to iTVF, added minimum element width, and made it so elements can be

    virtually any size.

    Rev 03 - 13 Jan 2012 - Jeff Moden - Modified to produce space delimited instead of comma delimited.

    **********************************************************************************************************************/

    --===== Declare the I/0

    (

    @pNumberOfRows INT,

    @pNumberOfElementsPerRow INT,

    @pMinElementwidth INT,

    @pMaxElementWidth INT

    )

    RETURNS TABLE

    AS

    RETURN

    --===== This creates and populates a test table on the fly containing a

    -- sequential column and a randomly generated CSV Parameter column.

    SELECT TOP (@pNumberOfRows) --Controls the number of rows in the test table

    ISNULL(ROW_NUMBER() OVER (ORDER BY(SELECT NULL)),0) AS RowNum,

    CSV =

    (--==== This creates each CSV

    SELECT CAST(

    STUFF( --=== STUFF get's rid of the leading comma

    ( --=== This builds CSV row with a leading comma

    SELECT TOP (@pNumberOfElementsPerRow) --Controls the number of CSV elements in each row

    ' '

    + LEFT(--==== Builds random length variable within element width constraints

    LEFT(REPLICATE('1234567890',CEILING(@pMaxElementWidth/10.0)), @pMaxElementWidth),

    ABS(CHECKSUM((SELECT MyNewID FROM dbo.iFunction)))

    % (@pMaxElementWidth - @pMinElementwidth + 1) + @pMinElementwidth

    )

    FROM sys.All_Columns ac3 --Classic cross join pseudo-cursor

    CROSS JOIN sys.All_Columns ac4 --can produce row sets up 16 million.

    WHERE ac3.Object_ID <> ac1.Object_ID --Without this line, all rows would be the same.

    FOR XML PATH('')

    )

    ,1,1,'')

    AS VARCHAR(8000))

    )

    FROM sys.All_Columns ac1 --Classic cross join pseudo-cursor

    CROSS JOIN sys.All_Columns ac2 --can produce row sets up 16 million rows

    ;

    GO

    SELECT *

    FROM dbo.CreateSpaceDelimited8K(1000,10,5,10)

    ;

    In case someone is interested, here's my "standard" million row test table example...

    /**********************************************************************************************************************

    Purpose:

    Create a voluminous test table with various types of highly randomized data.

    --Jeff Moden

    **********************************************************************************************************************/

    --===== Conditionally drop the test table to make reruns easier

    IF OBJECT_ID('dbo.JBMTest','U') IS NOT NULL

    DROP TABLE dbo.JBMTest

    ;

    --===== Create and populate a 1,000,000 row test table.

    -- "SomeID" has a range of 1 to 1,000,000 unique numbers

    -- "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- "SomeMoney has a range of 0.00 to 100.00 non-unique numbers

    -- "SomeDateTime" has a range of >=01/01/2000 and <01/01/2020 non-unique date/times

    -- "SomeDate" has a range of >=01/01/2000 and <01/01/2020 non-unique "whole dates"

    -- "SomeName" contains random characters at random lengths from 2 to 20 characters

    SELECT TOP 1000000

    SomeID = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID())) % 50000 + 1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))

    + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')),

    SomeMoney = CAST(RAND(CHECKSUM(NEWID())) * 100 AS DECIMAL(9,2)), --Note rounding

    SomeDateTime = RAND(CHECKSUM(NEWID())) * DATEDIFF(dd,'2000','2020') + CAST('2000' AS DATETIME),

    SomeDate = ABS (CHECKSUM(NEWID())) % DATEDIFF(dd,'2000','2020') + CAST('2000' AS DATETIME),

    SomeName = RIGHT(NEWID(),ABS(CHECKSUM(NEWID())) % 19 + 2)

    INTO dbo.JBMTest

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    ALTER TABLE dbo.JBMTest

    ADD CONSTRAINT PK_JBMTest PRIMARY KEY CLUSTERED (SomeID) WITH FILLFACTOR = 90

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Regarding:

    GSquared (1/12/2012)


    Another way is to get something like RedGate's data generator.

    For any serious usage, this is a much better approach than hacking something together based on NEWID. There are probably several such tools available, but this is the one I use personally. You can find details at: http://www.red-gate.com/products/sql-development/sql-data-generator/. It even has a free 14-trial so you can see what you would be missing. I have no connection with the company concerned, by the way.

  • SQL Kiwi (1/13/2012)


    Regarding:

    GSquared (1/12/2012)


    Another way is to get something like RedGate's data generator.

    For any serious usage, this is a much better approach than hacking something together based on NEWID. There are probably several such tools available, but this is the one I use personally. You can find details at: http://www.red-gate.com/products/sql-development/sql-data-generator/. It even has a free 14-trial so you can see what you would be missing. I have no connection with the company concerned, by the way.

    Agreed, although it's difficult to include such generated data (million rows) in a post or an article. πŸ˜€

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/15/2012)


    Agreed, although it's difficult to include such generated data (million rows) in a post or an article. πŸ˜€

    A script is certainly compact, but using NEWID means the data (and statistics) will not be exactly reproducible for anyone else. Another option is to create a database for the post or article, and upload that to one of the free file-sharing sites like Media Fire. This way, people don't have to run scripts and will get an exact copy of the data.

    Example: backup of an example database containing a 1 million row data table: http://www.mediafire.com/?0q5rsewn5a0tafv

  • I absolutely agree... the warning about NEWID() not producing precisely the same information for any two runs is well known and the reminder is much appreciated. Most demonstrations of code can withstand such a thing, though.

    Generating the data with a script that uses NEWID() avoids another problem... backwards compatibility for restores. Here's what happens when someone using SQL Server 2005 tries to restore the Example.bak DB you were kind enough to provide.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/16/2012)


    Generating the data with a script that uses NEWID() avoids another problem... backwards compatibility for restores. Here's what happens when someone using SQL Server 2005 tries to restore the Example.bak DB you were kind enough to provide.

    Certainly. One would probably state version requirements, or provide a zipped 2005 database file (mdf only) or backup. Both of these could generally be restored or attached to any version of SQL Server 2005-2012. I happened to use 2008 R2, and should have mentioned that.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply