Generating Dummy Data for Testing
I have worked on a number of projects where there hasn’t been enough data to test reports and extracts and other database applications. The database was a third party database and the vendor wouldn’t allow data to be entered via the back end. We didn’t have enough data to test response times or to tune SQL until after GO-Live. This caused a lot of rework after GO-Live which in turn affected the perception of quality of the project as a whole.
The purpose of this article is to provide a means of generating random data which can be inserted into a database for testing purposes.
The scripts and methodology allow sample data to be generated. The scripts below create a number of tables for storing different data descriptors, e.g. name, street type, number, transactions. Some tables are reused to generate street names and post codes.
If you wanted data that was only based on data type and had no other meaning you could classify the tables by data type; e.g. datetime, character and numeric.
Data is generated by creating Cartesian joins on the tables. Using the generated dataset this gives 8,000,000 (10 x 10 x 1000 x 4 x 20 x 1) rows of data. Id values are generated using a sequence. Child – Parent relationships are created by including the parent id in a Cartesian product with the child data (See below).
Diagram 1 - Populating the Application database
Diagram 2 - Use of Synonyms where changes to the application database are not allowed.
Architecture
The solution is broken into a number of layers. They are:
- the Base Data Layer
- the Reporting Data Layer
These layers can be within different schemas within the database.
Base Data Layer
The base data layer contains the dummy data tables, any data generation scripts, and a sequence or sequences for generating Ids.
Reporting Data Layer
This layer contains derived tables and views which create data similar to what would be seen in the application that you are testing. There may be a number of steps involved in getting data from the base tables into a format that can be imported into your application database.
If data cannot be created in the application database, create tables in the Reporting Data Layer and then create synonyms to them, populate the tables and use the synonyms for reporting. When the system goes live, point the synonyms to the application database. In this case, the Reporting Data Layer and Base Data Layers may be in a separate database to the application database.
How Cartesian joins work
A Cartesian join multiplies every row in a dataset with every row in the joined dataset. Imagine I have two tables with data as shown below:
First_Name Last_Name
=========== ===============
Bill S Preston Esq
Ted Theodore Logan
Attila the Hun Attila the Hun
SQL Server TSQL provides the CROSS JOIN which operates as a Cartesian join, unless you include a WHERE clause which links the two tables; e.g.
Case 1 Cartesian join
Select First_Name.Name + ' ' + Last_Name.Name Surname From First_name Cross Join Last_Name
Result set
Surname
============
Bill S Preston Esq
Ted S Preston Esq
Attila the Hun S Preston Esq
Bill Theodore Logan
Ted Theodore Logan
Attila the Hun Theodore Logan
Bill Attila the Hun
Ted Attila the Hun
Attila the Hun Attila the Hun
Case 2 Inner join
Select First_Name.Name + ' ' + Last_Name.Name Surname From First_name Cross Join Last_Name Where First_Name.Name = Last_Name.Name
Result set
Surname
========
Attila the Hun Attila the Hun
Case 3 Cartesian join with where clause
Select First_Name.Name + ' ' + Last_Name.Name Surname From First_name Cross Join Last_Name Where First_Name.Name = ‘Attila the Hun’
Result set
Surname
===============================
Attila the Hun S Preston Esq
Attila the Hun Theodore Logan
Attila the Hun Attila the Hun
Creating the data
The attached scripts will generate the following database items:
Tables
There are 5 tables created with the names and rowcounts shown below.
Name | Rows |
First_Name | 10 |
Last_Name | 10 |
Numbers | 1000 |
Street_Types | 4 |
Transactions | 20 |
In addition, there is a function and a sequence needed.
Function
- Random_Number
Sequences
- dbo.Dummy
Step 1 Create a new database.
No TSQL is included here as nearly everyone’s hardware setup will be different. The database in this example is called Dummy_Data.
Step 2 Create Database Objects
These scripts create the objects listed above.
USE Dummy_Data GO CREATE TABLE dbo.First_Names( Name VARCHAR(250) NOT NULL PRIMARY KEY ) GO CREATE TABLE dbo.Last_Names( Name VARCHAR(250) NOT NULL PRIMARY KEY ) GO CREATE TABLE dbo.Numbers( Num SMALLINT NOT NULL PRIMARY KEY ) GO CREATE TABLE dbo.Street_Type( Street_Type VARCHAR(20) NOT NULL PRIMARY KEY ) GO CREATE TABLE dbo.Transactions( Date1 DATETIME NULL, Date2 DATETIME NULL, Amount1 MONEY NULL, Amount2 MONEY NULL, Amount3 MONEY NULL ) GO create sequence dbo.Dummy start with 1 increment by 1 GO CREATE FUNCTION Random_Number(@Lower INT, @Upper INT, @seed FLOAT) RETURNS INT AS BEGIN DECLARE @Random INT; SELECT @Random = ROUND(((@Upper - @Lower -1) * @Seed + @Lower), 0) RETURN @Random END GO
Step 3 Populate Names Data
Next we need to insert data into these tables that will be used to generate our other data. Please feel free to modify these values as needed.
INSERT INTO First_Names VALUES('Bill') INSERT INTO First_Names VALUES('Bob') INSERT INTO First_Names VALUES('Buffy') INSERT INTO First_Names VALUES('Ethan') INSERT INTO First_Names VALUES('Jamie') INSERT INTO First_Names VALUES('John') INSERT INTO First_Names VALUES('Kevin') INSERT INTO First_Names VALUES('Lee') INSERT INTO First_Names VALUES('Lincoln') INSERT INTO First_Names VALUES('Megan') GO INSERT INTO Last_Names VALUES('Blutofski') INSERT INTO Last_Names VALUES('Brown') INSERT INTO Last_Names VALUES('Fitz-Waller') INSERT INTO Last_Names VALUES('Fluffy') INSERT INTO Last_Names VALUES('Grey') INSERT INTO Last_Names VALUES('Guinea-Pig') INSERT INTO Last_Names VALUES('Holmes') INSERT INTO Last_Names VALUES('Whalloper') INSERT INTO Last_Names VALUES('Ingram') INSERT INTO Last_Names VALUES('Bonkers') GO INSERT INTO street_type VALUES ('Drive') INSERT INTO street_type VALUES ('Place') INSERT INTO street_type VALUES ('Road') INSERT INTO street_type VALUES ('Street') GO
Step 4 Populate Numbers Table
The numbers table can fulfil a number of uses. The data in this table has been limited to 1000 entries, but you can change that if you feel the need. This script populates the table with consecutive numbers
DECLARE @iVal INT BEGIN SET @iVal = 1; WHILE @iVal <= 1000 BEGIN INSERT INTO Numbers VALUES(@iVal) SET @iVal = @iVal + 1 END END GO
Step 5 Populate Transaction Data
The transaction data consists of two dates and three amounts. The first date occurs in the first half of the month and the second date occurs in the last half of the month. The three amounts have been determined at random. The second amount is 10% of the third amount, and the first amount = the second amount + the third amount. These numbers can be used to emulate sales values, profits etc.
The script to create this data is below.
BEGIN DECLARE @ddate DATETIME DECLARE @ddate2 DATETIME DECLARE @amt MONEY DECLARE @amt2 MONEY DECLARE @amt3 MONEY DECLARE @dollars INT DECLARE @cents INT DECLARE @year INT DECLARE @month INT DECLARE @day INT DECLARE @iLoop INT SELECT @iLoop = 1 WHILE (@iLoop <= 20) BEGIN --dates SELECT @day = dbo.Random_Number(1,14,Rand()) SELECT @month = dbo.Random_Number(1,12,Rand()) SELECT @year = dbo.Random_Number(2010,2013,Rand()) SELECT @ddate = cast(cast(@month AS VARCHAR(2)) + '/' + CAST(@day AS VARCHAR(2)) + '/' + CAST(@year AS VARCHAR(4)) AS DATETIME) SELECT @day = dbo.Random_Number(15,28,Rand()) SELECT @ddate2 = CAST(CAST(@month AS VARCHAR(2)) + '/' + CAST(@day as VARCHAR(2)) + '/' + CAST(@year AS VARCHAR(4)) AS DATETIME) --amounts SELECT @dollars = dbo.Random_Number(15,1000,RAND()) SELECT @cents = dbo.Random_Number(0,99,RAND()) SELECT @amt = CONVERT(MONEY,CAST(@dollars AS VARCHAR(4)) + '.' + RIGHT('00' + CAST(@cents AS VARCHAR(2)),2)) SELECT @dollars = @amt /1.1 SELECT @amt3 = CONVERT(MONEY,CAST(@dollars AS VARCHAR(4)) + '.' + RIGHT('00' + CAST(@cents AS VARCHAR(2)),2)) SELECT @amt2 = @amt - @amt3 INSERT INTO Transactions VALUES(@ddate,@ddate2,@amt,@amt2,@amt3) SELECT @iLoop = @iLoop + 1 END --loop END GO
Step 6 Creating Names
This step uses a CROSS JOIN to create 100 names. The id is generated using the sequence.
SELECT NEXT VALUE FOR dbo.Dummy Id, fn.name First_Name, ln.name Last_Name, nm.Num Street_no, sn.Name as street_name, st.Street_Type INTO Customer_Data FROM first_names fn CROSS JOIN last_names ln CROSS JOIN numbers nm CROSS JOIN last_names sn CROSS JOIN street_type st
Step 7 Creating Transactions
This step uses a CROSS JOIN to populate a new table with transactions and IDs. The Ids will match the customer Ids. Every customer will end up with a full set of 20 transactions.
SELECT nd.Id, tr.* INTO Transaction_Data FROM Customer_Data nd CROSS JOIN Transactions tr
Conclusion
This article has highlighted three things:
- The use of CROSS JOIN
- Without a where clause table A is multiplied by table B
- With a where clause that links the two tables the result is the same as an inner join
- With a where clause that limits the data in either table
- Generating sample data by using CROSS JOIN
- The use of synonyms to provide a level of abstraction to your reporting layer, remembering that any optimisation via indexes will still need to occur on the application database.
I hope this helps some of you.