Introduction
This is the first installment of a series of articles which would look into some of the performance tips that could be applicable to the day-to-day programming life of most SQL Server programmers. Anyone would agree that performance is the key to the success of any application. In this series, I will try to present some of the common scenarios and explain a few alternate approaches which would provide better performance.
This morning I read the article by Wayne Fillis where he presented an excellent explanation of the NOLOCK query optimizer hint. To demonstrate the performance benefits, he populated a table with 5 million records and ran the queries with and without NOLOCK hint. He warns that the TSQL code that generated 5 million rows will take more than an hours on a high-end machine and may take a few hours on a machine with lower configuration.
Many times we need to generate large number of rows to do various performance tests. However, most of the times, it takes a few hours to generate the data that we need. It would be interesting to look into alternate ways to generate data quickly and efficiently. This article presents a very efficient approach to generate large number of rows.
Most of the time, I have seen people writing a loop that runs for N times and executes an INSERT statement. Thus, to generate 5 million rows, you need to execute 5 million INSERT statements. This adds to the major part of the delay that we experience while generating data. If we could use a single INSERT statement that inserts 5 million records at one go, we could gain amazing performance benefits.
Approach
As I mentioned above, we would be using a different approach to generate records. We will use a single INSERT statement that will insert 5 million records to the target table. To do that, essentially, we need to have a SELECT statement that returns the desired number of rows. How do we get into that? The steps below demonstrate that.
The first step is to write a query that returns 10 rows containing numbers 0 to
9. The following TSQL query does that.
1 SELECT
0 as Number
2
UNION SELECT 1
3
UNION SELECT 2
4
UNION SELECT 3
5
UNION SELECT 4
6
UNION SELECT 5
7
UNION SELECT 6
8
UNION SELECT 7
9
UNION SELECT 8
10
UNION SELECT 9
Now let us write a query that uses the digits above and generate a result set containing 100 records (0 to 100).
1 WITH
digits AS (
2
SELECT 0 as Number
3
UNION SELECT 1
4
UNION SELECT 2
5
UNION SELECT 3
6
UNION SELECT 4
7
UNION SELECT 5
8
UNION SELECT 6
9
UNION SELECT 7
10
UNION SELECT 8
11
UNION SELECT 9
12 )
13 SELECT
(tens.Number * 10) + ones.Number as Number
14 FROM
digits as tens
15 CROSS
JOIN digits as ones
I am using a CTE (Common Table Expression) to simplify the code above. A CROSS JOIN is used on the previous query and it generates 100 records on the fly. CTE is introduced by SQL Server 2005. You can get the same results in SQL Server 2000 by running the following query.
1 SELECT
(tens.Number * 10) + ones.Number as Number
2 FROM
(
3
SELECT 0 as Number
4
UNION SELECT 1
5
UNION SELECT 2
6
UNION SELECT 3
7
UNION SELECT 4
8
UNION SELECT 5
9
UNION SELECT 6
10
UNION SELECT 7
11
UNION SELECT 8
12
UNION SELECT 9
13 ) as
tens
14 CROSS
JOIN (
15
SELECT 0 as Number
16
UNION SELECT 1
17
UNION SELECT 2
18
UNION SELECT 3
19
UNION SELECT 4
20
UNION SELECT 5
21
UNION SELECT 6
22
UNION SELECT 7
23
UNION SELECT 8
24
UNION SELECT 9
25 ) as
ones
The above example clearly shows how useful it is to use a CTE. At the next step, let us generate 10 million records.
1 WITH
digits AS (
2
SELECT 0 as Number
3
UNION SELECT 1
4
UNION SELECT 2
5
UNION SELECT 3
6
UNION SELECT 4
7
UNION SELECT 5
8
UNION SELECT 6
9
UNION SELECT 7
10
UNION SELECT 8
11
UNION SELECT 9
12 )
13 SELECT
14 (millions.Number
* 1000000)
15 + (hThousands.Number
* 100000)
16 + (tThousands.Number
* 10000)
17 + (thousands.Number
* 1000)
18 + (hundreds.Number
* 100)
19 + (tens.Number
* 10)
20 + ones.Number
AS Number
21
FROM digits AS ones
22
CROSS JOIN digits AS tens
23
CROSS JOIN digits AS hundreds
24
CROSS JOIN digits AS thousands
25
CROSS JOIN digits AS tThousands
26 CROSS
JOIN digits AS hThousands
27 CROSS
JOIN digits AS millions
The above TSQL generates 10 million records on the fly. It may take 1 to 2 minutes depending upon the configuration of your computer. Now that we have the required number of records, we can write the INSERT statement to populate the table that we need.
For the purpose of this example, I would like to take the same example Wayne Fillis presented. I am going to re-write the first example he presented, using the new approach we discussed above. The following query generates 5 million records and inserts them to the ORDERS table.
1 WITH
2 -- first
CTE which returns 10 rows (0-9)
3 digits
AS (
4
SELECT 0 as Number
5
UNION SELECT 1
6
UNION SELECT 2
7
UNION SELECT 3
8
UNION SELECT 4
9
UNION SELECT 5
10
UNION SELECT 6
11
UNION SELECT 7
12
UNION SELECT 8
13
UNION SELECT 9
14 )
15 -- second
CTE which returns 10 million rows by using
16 -- a
CROSS JOIN on the first CTE
17 , dig
AS (
18
SELECT
19
(millions.Number * 1000000)
20
+ (hThousands.Number * 100000)
21
+ (tThousands.Number * 10000)
22
+ (thousands.Number * 1000)
23
+ (hundreds.Number * 100)
24
+ (tens.Number * 10)
25
+ ones.Number AS Number
26
FROM digits AS ones
27
CROSS JOIN digits AS tens
28
CROSS JOIN digits AS hundreds
29
CROSS JOIN digits AS thousands
30
CROSS JOIN digits AS tThousands
31
CROSS JOIN digits AS hThousands
32
CROSS JOIN digits AS millions
33 )
34 -- Third
CTE which generates a "Product ID" and "Order Date".
35 -- Product
ID is repeated after 500,000 records.
36 -- after
every 1000 records, "order date" moves backward.
37 , prod
AS (
38 SELECT
number, (number % 500000) + 1 AS ProductID,
GETDATE() - (number / 1000)
AS OrderDate
39 FROM
dig
40 WHERE
number < 5000000 -- we need only 5 million records
🙂
41 )
42 -- the
insert statement goes here
43 INSERT
Orders (ProductID, OrderDate)
44
SELECT productID,
45
DATEADD(dd,0, DATEDIFF(dd,0,OrderDate))
-- strips off "time" from date value
46
FROM prod
The above code runs in 2.45 minutes on my laptop (Toshiba Satellite Pro, Solo Core, 1 GB RAM). It runs much faster
compared to the loop that inserts same number of records in close to 90 minutes.
Conclusions
I referred to the example presented by Wayne for the purpose of demonstration only. His article presented a practical scenario where we need large number of records generated for various performance tests. This article presents a more optimized TSQL code that generates data.