Generating Test Data: Part 2 - Generating Sequential and Random Dates

Question

Post reply

Generating Test Data: Part 2 - Generating Sequential and Random Dates

Viewing 15 posts - 16 through 30 (of 40 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 1

GPO (4/25/2012)
The spam people clearly target the articles that everyone is going to read...

It's ironic that I have the SPAM people to thank for such a nice compliment. Thanks, GPO.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Cadavre SSC-Forever Points: 41690 More actions · Answer 2

Jeff Moden (4/26/2012)
Cadavre (4/26/2012)
Nicely explained Jeff.
I may have to "borrow" your method of generating random DATETIME data, my method is more difficult to understand when people glance at it. 😀
Borrow away. These aren't my methods. They pretty well standard for folks that have been using NEWID() for such things over the years because they fall into the classic random number mathematical formulas.
If you don't mind, could you post your method? It's always interesting to see how others do things.

Certainly. Generally, when I need "random" datetime data I go with this: -

IF object_id('tempdb..#testEnvironment') IS NOT NULL

BEGIN

DROP TABLE #testEnvironment

END

--1,000,000 "Random" rows of data

SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

RAND(CHECKSUM(NEWID())) * 366 /*(Number of days)*/ + CAST('2000' AS DATETIME) /*(Start date, e.g. '2000-01-01 00:00:00'*/ AS randomDate

INTO #testEnvironment

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3;

If instead I want "random" date, I normally go with this: -

IF object_id('tempdb..#testEnvironment') IS NOT NULL

BEGIN

DROP TABLE #testEnvironment

END

--1,000,000 "Random" rows of data

SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

DATEADD(DAY,((ABS(CHECKSUM(NEWID())) % 366 /*(Number of days)*/) + 1),CAST('2000' AS DATE) /*(Start date, e.g. '2000-01-01*/) AS randomDate

INTO #testEnvironment

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3;

On an internal wiki-page at work, I long since added a page with a brief explanation of how to create pseudo-random data. The script looks like this: -

--Standard TestEnvironment of 1,000,000 rows of random-ish data

IF object_id('tempdb..#testEnvironment') IS NOT NULL

BEGIN

DROP TABLE #testEnvironment;

END;

--1,000,000 Random rows of data

SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID,

RAND(CHECKSUM(NEWID())) * 30000 + CAST('1945' AS DATETIME) AS randomDateTime,

--SQL SERVER 2008 ONLY!! ONLY FOR USE ON 9.0 AND ABOVE

DATEADD(DAY,((ABS(CHECKSUM(NEWID())) % 366) + 1),CAST('2000' AS DATE)) AS randomDate,

ABS(CHECKSUM(NEWID())) AS randomBigInt,

(ABS(CHECKSUM(NEWID())) % 100) + 1 AS randomSmallInt,

RAND(CHECKSUM(NEWID())) * 100 AS randomSmallDec,

RAND(CHECKSUM(NEWID())) AS randomTinyDec,

RAND(CHECKSUM(NEWID())) * 100000 AS randomBigDec,

CONVERT(VARCHAR(6),CONVERT(MONEY,RAND(CHECKSUM(NEWID())) * 100),0) AS randomMoney

INTO #testEnvironment

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3;

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 3

I see what you mean. Same principle... just harder for some folks to see. Thanks, Craig.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 4

TheSQLGuru (4/25/2012)
Hey Jeff, can you put a downloadable file with the relevant operational code parts of the post? Thanks in advance, and wonderful stuff as always!

If I understand correctly, those are pretty well summarized in the last two sections of the article. Is that what you want as a downloadable file?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Samrat Bhatnagar Grasshopper Points: 19 More actions · Answer 5

These two part series were really useful. Thanks.

Any suggestions on how to generate test data for following scenarios:

1. Two tables linked using PK-FK relationship e.g. Product Category and Product Subcategory

2. Self Referential Tables like the Employee table with EmployeeId, ManagerId, <Other employee details>

3. Using the master tables in 1, 2 generate a table that has ProductFK, EmployeeFK, <Some data> as in a Data warehouse.

Ron McCullough SSC Guru Points: 63877 More actions · Answer 6

Jeff Moden (4/26/2012)
TheSQLGuru (4/25/2012)
Hey Jeff, can you put a downloadable file with the relevant operational code parts of the post? Thanks in advance, and wonderful stuff as always!
If I understand correctly, those are pretty well summarized in the last two sections of the article. Is that what you want as a downloadable file?

Certainly, am too lazy to do a cut and paste to my Sandbox DB...

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 7

Samrat Bhatnagar (4/28/2012)
These two part series were really useful. Thanks.
Any suggestions on how to generate test data for following scenarios:
1. Two tables linked using PK-FK relationship e.g. Product Category and Product Subcategory
2. Self Referential Tables like the Employee table with EmployeeId, ManagerId, <Other employee details>
3. Using the master tables in 1, 2 generate a table that has ProductFK, EmployeeFK, <Some data> as in a Data warehouse.

Sure. I might be able to include some of that in part 3.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Dwain Camps SSC Guru Points: 86978 More actions · Answer 8

Jeff Moden (4/30/2012)
Samrat Bhatnagar (4/28/2012)
These two part series were really useful. Thanks.
Any suggestions on how to generate test data for following scenarios:
1. Two tables linked using PK-FK relationship e.g. Product Category and Product Subcategory
2. Self Referential Tables like the Employee table with EmployeeId, ManagerId, <Other employee details>
3. Using the master tables in 1, 2 generate a table that has ProductFK, EmployeeFK, <Some data> as in a Data warehouse.
Sure. I might be able to include some of that in part 3.

How about something in part XXX about generating data with random gaps and islands. 🙂

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 9

Heh... XXX... did you really mean "Part 30"?

Random gaps and islands are easy although known gaps and islands make life a whole lot easier test wise. Just build a wad of sequential dates and randomly delete a percentage of them.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Gail Wanabee SSCrazy Eights Points: 8471 More actions · Answer 10

Jeff,

I may have missed it glancing through all of the posts. When do you plan to produce Part 3?

Thanks for the effort you expended on such excellent documentation.

LC

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 11

Lee Crain (5/4/2012)
Jeff,
I may have missed it glancing through all of the posts. When do you plan to produce Part 3?
Thanks for the effort you expended on such excellent documentation.
LC

Apologies for the late response. It's one of those things where you say to yourself that you'll answer that one "tomorrow".

I'd intended to be done with Part 3 by now but have barely scratched the surface of it. I'll going to try to get it done this week and submit it next weekend. It takes about 4 to 6 weeks after submital for an article to come out.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Gail Wanabee SSCrazy Eights Points: 8471 More actions · Answer 12

Thanks. I'm looking forward to it, as is my company's software development staff.

LC

GPO SSCarpal Tunnel Points: 4574 More actions · Answer 13

Now that we're all up-to-speed on generating test data, 😉 you know what would be trez cool (not suggesting that Jeff should have to do it, but it would be great if it existed)? A step-by-step guide to setting up an empirical test environment. It seems to me that there are too many traps for us new player, that will lead us to incorrectly conclude that method A is better/worse/no different to method B.

Issues to consider, for example:

:: I've read recently that you shouldn't conclude that because a query takes X seconds to bring the data to your screen, that it's a fair representation of how long the query took to run. Most of the "execution" time could simply be shipping a million rows of data over the network. The workaround might be to run the data into a temp table or a table variable... or... something (he said knowing he was out of his depth)

:: How should we treat the cache and buffers?

:: How do we set up a timer. I simply set variables to getdate() at the start and end of what I'm trying to test, and datediff them. Is this reasonable?

:: What are the pitfalls to taking the execution plan's "cost relative to batch" and "estimated subtree cost" literally? I've seen it wildly inaccurate, and not just because of outdated statistics and so on. It's often because it can't accurately estimate the cost of a scalar function, for example.

:: I've used Adam Machanic's SQLQueryStress tool before because it seems that it can give you an idea of how the query will perform with multiple threads and iterations with a variety of parameters.

...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

Devendrakumar SSC-Forever Points: 42493 More actions · Answer 14

Devendrakumar

SSC-Forever

Points: 42493

May 20, 2012 at 4:32 am

#1490176

Nice Article Jeff!!!

Jeff Moden SSC Guru Points: 1004403 More actions · Answer 15

GPO (5/19/2012)
Now that we're all up-to-speed on generating test data, 😉 you know what would be trez cool (not suggesting that Jeff should have to do it, but it would be great if it existed)? A step-by-step guide to setting up an empirical test environment. It seems to me that there are too many traps for us new player, that will lead us to incorrectly conclude that method A is better/worse/no different to method B.
Issues to consider, for example:
:: I've read recently that you shouldn't conclude that because a query takes X seconds to bring the data to your screen, that it's a fair representation of how long the query took to run. Most of the "execution" time could simply be shipping a million rows of data over the network. The workaround might be to run the data into a temp table or a table variable... or... something (he said knowing he was out of his depth)
:: How should we treat the cache and buffers?
:: How do we set up a timer. I simply set variables to getdate() at the start and end of what I'm trying to test, and datediff them. Is this reasonable?
:: What are the pitfalls to taking the execution plan's "cost relative to batch" and "estimated subtree cost" literally? I've seen it wildly inaccurate, and not just because of outdated statistics and so on. It's often because it can't accurately estimate the cost of a scalar function, for example.
:: I've used Adam Machanic's SQLQueryStress tool before because it seems that it can give you an idea of how the query will perform with multiple threads and iterations with a variety of parameters.

You should see the nasty problems that come up with supposedly reliable things like SET STATISTICS TIME ON. I'm mostly convinced that the errors there are the reason the supposed best practice of avoiding scalar UDFs exists. I say, "It Depends" and have "guts" of an "SQL Spackle" article setup for it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)