Question regarding the difference between two different ways to populate a new table in TSQL

Question

Question regarding the difference between two different ways to populate a new table in TSQL

cathyDePaolo

SSC Eights!

Points: 989
More actions
August 22, 2014 at 6:20 am

#99925

Hi. I was recently told that the following block of code:
select a.customerid,
a.customerName
into NewTable
from(
select customerid,
customername
from database.dbo.customer
union
select customerid,
customername
from database_1.dbo.customer
) a
is better performing and preferred to:
select customerid,
customername
into NewTable
from database.dbo.customer
union
select customerid,
customername
from database_1.dbo.customer
Can anyone comment on this and help me understand why putting the query in a subquery is better performing or preferable to just putting the results into a table? Thank you.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Luis Cazares SSC Guru Points: 183706 More actions · Answer 1

How did you test the preformance?

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

cathyDePaolo SSC Eights! Points: 989 More actions · Answer 2

I checked the execution plans of each, and they are the same. I am of the opinion that the subquery is not going to improve performance of the query, and the execution plans seem to prove me correct. I prefer the straight "select into" version because it is less code and, in my opinion, is a cleaner version.

I was wondering if anyone in the community knows why I would be told, by a database architect, that the subquery version of this query would be better performing. I was forced to change all of my queries to use the subquery version, rather than a straight select into. I was not given a reason that satisfied me, so I am hoping someone in this community can tell me why I would be directed in this manner.

Thank you for your time.

Sean Lange SSC Guru Points: 286590 More actions · Answer 3

Cathy DePaolo (8/22/2014)
I checked the execution plans of each, and they are the same. I am of the opinion that the subquery is not going to improve performance of the query, and the execution plans seem to prove me correct. I prefer the straight "select into" version because it is less code and, in my opinion, is a cleaner version.
I was wondering if anyone in the community knows why I would be told, by a database architect, that the subquery version of this query would be better performing. I was forced to change all of my queries to use the subquery version, rather than a straight select into. I was not given a reason that satisfied me, so I am hoping someone in this community can tell me why I would be directed in this manner.
Thank you for your time.

Because that is the syntax the architect is comfortable with. There is no valid reason that one is preferred over the other. Those two queries are functionally equivalent and it is no surprise that the execution plans are identical.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 4

Cathy DePaolo (8/22/2014)
I checked the execution plans of each, and they are the same. I am of the opinion that the subquery is not going to improve performance of the query, and the execution plans seem to prove me correct. I prefer the straight "select into" version because it is less code and, in my opinion, is a cleaner version.
I was wondering if anyone in the community knows why I would be told, by a database architect, that the subquery version of this query would be better performing. I was forced to change all of my queries to use the subquery version, rather than a straight select into. I was not given a reason that satisfied me, so I am hoping someone in this community can tell me why I would be directed in this manner.
Thank you for your time.

Your architect is incorrect as others have pointed out and the time you spent changing all of your relevant queries was pointlessly wasted.

The exercise isn't a complete waste of time however - you're hopefully encouraged by the responses here to ask the next self-appointed expert who comes along to prove their crazy-*** theory before you waste good company time implementing it. You might also question what significant opportunities for improvement this particular database architect was failing to see.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Luis Cazares SSC Guru Points: 183706 More actions · Answer 5

ChrisM@Work (8/22/2014)
Cathy DePaolo (8/22/2014)
I checked the execution plans of each, and they are the same. I am of the opinion that the subquery is not going to improve performance of the query, and the execution plans seem to prove me correct. I prefer the straight "select into" version because it is less code and, in my opinion, is a cleaner version.
I was wondering if anyone in the community knows why I would be told, by a database architect, that the subquery version of this query would be better performing. I was forced to change all of my queries to use the subquery version, rather than a straight select into. I was not given a reason that satisfied me, so I am hoping someone in this community can tell me why I would be directed in this manner.
Thank you for your time.
Your architect is incorrect as others have pointed out and the time you spent changing all of your relevant queries was pointlessly wasted.
The exercise isn't a complete waste of time however - you're hopefully encouraged by the responses here to ask the next self-appointed expert who comes along to prove their crazy-*** theory before you waste good company time implementing it. You might also question what significant opportunities for improvement this particular database architect was failing to see.

+1000

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Jeff Moden SSC Guru Points: 1004361 More actions · Answer 6

Cathy DePaolo (8/22/2014)
I checked the execution plans of each, and they are the same. I am of the opinion that the subquery is not going to improve performance of the query, [font="Arial Black"]and the execution plans seem to prove me correct.[/font] I prefer the straight "select into" version because it is less code and, in my opinion, is a cleaner version.
I was wondering if anyone in the community knows why I would be told, by a database architect, that the subquery version of this query would be better performing. I was forced to change all of my queries to use the subquery version, rather than a straight select into. I was not given a reason that satisfied me, so I am hoping someone in this community can tell me why I would be directed in this manner.
Thank you for your time.

Gosh, be careful now. In this case, it works out, but the Execution plans never [font="Arial Black"]prove [/font]anything having to do with performance. Not even the ACTUAL execution plans. The numbers that show up in an execution plan are frequently just estimates and the "% of batch" number can be 100% "wrong" (if not wrong, then very deceiving.

First, build the following test table so we have some known example data to work with...

/**********************************************************************************************************************

Purpose:

Create a voluminous test table with various types of highly randomized data.

--Jeff Moden

**********************************************************************************************************************/

--===== Conditionally drop the test table to make reruns easier

IF OBJECT_ID('tempdb..#JBMTest','U') IS NOT NULL

DROP TABLE #JBMTest

;

--===== Create and populate a 1,000,000 row test table.

-- "SomeID" has a range of 1 to 1,000,000 unique numbers

-- "SomeInt" has a range of 1 to 50,000 non-unique numbers

-- "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

-- "SomeMoney has a range of 0.00 to 100.00 non-unique numbers

-- "SomeDateTime" has a range of >=01/01/2000 and <01/01/2020 non-unique date/times

-- "SomeDate" has a range of >=01/01/2000 and <01/01/2020 non-unique "whole dates"

-- "SomeName" contains random characters at random lengths from 2 to 20 characters

SELECT TOP 1000000

SomeID = IDENTITY(INT,1,1),

SomeInt = ABS(CHECKSUM(NEWID())) % 50000 + 1,

SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))

+ CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')),

SomeMoney = CAST(RAND(CHECKSUM(NEWID())) * 100 AS DECIMAL(9,2)), --Note rounding

SomeDateTime = RAND(CHECKSUM(NEWID())) * DATEDIFF(dd,'2000','2020') + CAST('2000' AS DATETIME),

SomeDate = ABS (CHECKSUM(NEWID())) % DATEDIFF(dd,'2000','2020') + CAST('2000' AS DATETIME),

SomeName = RIGHT(NEWID(),ABS(CHECKSUM(NEWID())) % 19 + 2)

INTO #JBMTest

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

ALTER TABLE #JBMTest

ADD CONSTRAINT PK_#JBMTest PRIMARY KEY CLUSTERED (SomeID)

;

Then, from the same window, run this code with the actual execution plan turned on where the second snippet actually takes twice as long as the first.

--===== Execution plan says these are the same...

-- Are they?

SET STATISTICS TIME,IO ON;

SELECT TOP 10

*

FROM #JBMTest

ORDER BY SomeDate;

SET STATISTICS TIME OFF;

PRINT '=================================================='

SET STATISTICS TIME,IO ON;

SET ROWCOUNT 10

SELECT *

FROM #JBMTest

ORDER BY SomeDate

SET ROWCOUNT 0

SET STATISTICS TIME OFF;

PRINT '=================================================='

Both pieces of code return exactly the same information from the test table. The Actual Execution plan says that both will take "50% of batch". But if you look at the "Messages" tab, you see the true story.

(10 row(s) affected)

Table '#JBMTest____________________________________________________________________________________________________________000000018739'. Scan count 5, logical reads 6831, physical reads 0, read-ahead reads 12, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

[font="Arial Narrow"] CPU time = 764 ms, elapsed time = 291 ms.[/font]

==================================================

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

(10 row(s) affected)

Table '#JBMTest____________________________________________________________________________________________________________000000018739'. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

[font="Arial Black"] CPU time = 1716 ms, elapsed time = 1584 ms.[/font]

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

==================================================

The "% of batch% can actually be 100% wrong as in the following code example. If you don't already have a Tally table, this code will build one for you. If you already have one, then skip that part...

This is to build the Tally table...

--===================================================================

-- Create a Tally table from 1 to 11000

--===================================================================

--===== Create and populate the Tally table on the fly.

SELECT TOP 11000

IDENTITY(INT,1,1) AS N

INTO dbo.Tally

FROM Master.sys.ALL_Columns ac1

CROSS JOIN Master.sys.ALL_Columns ac2

;

--===== Add a CLUSTERED Primary Key to maximize performance

ALTER TABLE dbo.Tally

ADD CONSTRAINT PK_Tally_N

PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

;

--===== Allow the general public to use it

GRANT SELECT ON dbo.Tally TO PUBLIC

;

Using the Tally table above, this next bit of code shows that the execution plan can be quite deceiving. It appears to say that the first query will take 0% of the batch and the second query takes 100% when nearly the exact opposite is true.

/****************************************************************************************

Purpose:

This code demonstrates that the estimated and actual execution plans in SQL Server can

be 100% INCORRECT and that the execution plan should only be relied on to provide hints

as to what may be wrong with a query rather than an absolute indication. This code runs

in SQL Server 2005 and up only.

The code creates a temp table for 10 years worth of dates starting with 2000-01-01 using

two different methods. The first method uses a recursive CTE and the second method uses

an on-the-fly creation of a "Tally" table. The output of each method is directed to a

"throw-away" variable to take display delays out of the picture.

Please see the separate documentation included in this package for addition explanations

and a copy of the execution actual execution plan.

--Jeff Moden (28 Sep 2009)

****************************************************************************************/

SET NOCOUNT ON

--=======================================================================================

-- Recursive method shown by (Name with-held)

--=======================================================================================

PRINT '========== Recursive method =========='

--===== Turn on some performance counters ===============================================

SET STATISTICS IO ON

SET STATISTICS TIME ON

DECLARE @Bitbucket DATETIME --Holds display output so display times aren't measured.

--===== Execute the code being tested ===================================================

DECLARE @DateVal DATETIME

SET @DateVal = '2000-01-01'

;with mycte as

(

select @DateVal AS DateVal

union all

select DateVal + 1

from mycte

where DateVal + 1 < DATEADD(yy, 30, @DateVal)

)

select @Bitbucket = d.dateval

from mycte d

OPTION (MAXRECURSION 0)

--===== Turn off the performance counters and print a separator =========================

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

PRINT REPLICATE('=',90)

GO

--=======================================================================================

-- Tally table method by Jeff Moden

--=======================================================================================

PRINT '========== Tally table method =========='

--===== Turn on some performance counters ===============================================

SET STATISTICS IO ON

SET STATISTICS TIME ON

DECLARE @Bitbucket DATETIME --Holds display output so display times aren't measured.

--===== Execute the code being tested ===================================================

DECLARE @StartDate AS DATETIME

SET @StartDate = '2000-01-01'

SELECT TOP (DATEDIFF(dd,@StartDate,DATEADD(yy,30,@StartDate)))

@Bitbucket = @StartDate-1+t.N

FROM dbo.Tally t

ORDER BY N

--===== Turn off the performance counters and print a separator =========================

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

PRINT REPLICATE('=',90)

GO

Here are the run results. Notice that what was marked as 0% of batch takes almost all of the time and that which was marked as 100% of batch takes almost zero time.

========== Recursive method ==========

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'Worktable'. Scan count 2, logical reads 65749, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

[font="Arial Black"] CPU time = 93 ms, elapsed time = 109 ms.[/font]

==========================================================================================

========== Tally table method ==========

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'Tally'. Scan count 1, logical reads 20, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

[font="Arial Black"] CPU time = 0 ms, elapsed time = 2 ms.[/font]

==========================================================================================

The reason for that disparity in that last example is that the execution plan is for just the first iteration of the Recursive CTE.

The ONLY way to prove that one method is better than the other is to run them and measure them.

As to the examples you gave, I join the others in asking what proof the DA has than one should be better than the other. As a man much wiser than I once said, "A single test is worth a thousand expert opinions". With that in mind, here's a test for what you posted...

First, here's the code to build 2 test DBs and populate them with test data...

--=====================================================================================================================

-- Create two test DBs. These may be dropped in their entirety when done testing

--=====================================================================================================================

/***********************

DROP DATABASE TestDB;

DROP DATABASE TestDB1

DROP TABLE #TestData

***********************/

CREATE DATABASE TestDB;

CREATE DATABASE TestDB1;

GO

--=====================================================================================================================

-- Create the test data in each DB

--=====================================================================================================================

--===== Create a common collection of data so that the random names are the same for each ID later on.

SELECT TOP 1000000

CustomerID = IDENTITY(INT,1,1)

,CustomerName = CAST(NEWID() AS VARCHAR(36))

INTO #TestData

FROM master.sys.all_columns ac1

CROSS JOIN master.sys.all_columns ac2

;

--===== Randomly distribute half into one database

SELECT TOP 500000

*

INTO TestDB.dbo.Customer

FROM #TestData

ORDER BY NEWID()

;

--===== Randomly distribute half into the other database.

-- Note that this will be some overlap with the other database.

SELECT TOP 500000

*

INTO TestDB1.dbo.Customer

FROM #TestData

ORDER BY NEWID()

;

--===== Do a little cleanup

DROP TABLE #TestData

;

And here's the code to test the two methods...

--=====================================================================================================================

-- Test each method from the original post for performance

--=====================================================================================================================

SET STATISTICS TIME,IO ON

;

--===== The altered method

select a.customerid,

a.customerName

into NewTable

from(

select customerid,

customername

from TestDB.dbo.customer

union

select customerid,

customername

from TestDB1.dbo.customer

) a

;

SET STATISTICS TIME,IO OFF

;

DROP TABLE dbo.NewTable

SET STATISTICS TIME,IO ON

;

--===== Original method

select customerid,

customername

into NewTable1

from TestDB.dbo.customer

union

select customerid,

customername

from TestDB1.dbo.customer

;

SET STATISTICS TIME,IO OFF

;

DROP TABLE dbo.NewTable1

;

PRINT '============================================================================================='

GO 5

What I get on my machine is two nearly identical scripts taking turns winning except for the first iteration which doesn't count in this kind of testing.

Beginning execution loop