Have you ever run into an error and been puzzled as to why that error occurred?
Recently I have been working on a project to backfill a development and QA environment. These environments are essential in any database environment to help ensure the production database will continue to work after changes have been made. These environments existed once upon a time but had been sorely neglected and process was not followed.
For this process, I am using some software to help generate diff scripts between the environments. Once the script has been generated I will execute that script in the appropriate environment (DEV or QA). I am doing this process for all tables, procs, views, and functions. This is also done iteratively database by database.
All-Stars
While performing this routine, occasionally an error might be encountered due to an order of processing that may need to be performed. In cases such as that, I would typically alter my process to account for this error. Usually that is not an issue if the tables are first created, then the views, and then the procs and functions. Other times, the create/alter script must be evaluated to determine what is causing the failure.
One of these failures that caused me a lot of angst was producing the following error in a few stored procedures.
Msg 205, Level 16, State 1, Line 4
All queries combined using a UNION, INTERSECT or EXCEPT operator must have an equal number of expressions in their target lists.
That error message is fair enough to evaluate. Based on the error one should expect that the stored procedure contains one of those keywords. I picked a stored procedure out of the mix that was generating this error and checked the code for it. Looking through the entire proc, there was no such instance of those keywords. This was temporarily puzzling. I continued to investigate and found that there was a view being used in that stored procedure. I decided to double-check the view and see if it might be the cause. Inside the view I see something like the following:
SELECT *
FROM dbo.Table1 WITH (nolock)
WHERE AccrueThruDate between '2010/01/31' and '2010/12/31'
UNION ALL
SELECT *
FROM dbo.Table2 WITH (nolock)
WHERE AccrueThruDate between '2010/01/31' and '2010/12/31'
UNION ALL
SELECT *
FROM Db2.dbo.Table1 WITH (nolock)
At first sight, I see a few things I don’t like but didn’t immediately pick up on the table in the third select. I saw the table name and moved on thinking it must be ok. Then I decided to sanity check more precisely what was there and saw that it was pointing to the same table name in a different database. Low and behold in the other database the table did not have the same number of columns. These tables are all pretty wide at 100+ columns each. The designer thought that every single column from each of these tables was needed and decided to take a shortcut. Had the columns been explicitly named, I would have gotten a more useful error message, in this case, such as the following:
Msg 207, Level 16, State 1, Line 3
Invalid column name 'Column102'.
That little change in message could save time when troubleshooting a problem such as what I have described. Granted, it does not tell me which database or table but at least I know which column is missing. This is one little reason why I like to explicitly name the columns in my Select statements.
Another reason why I like to use the column names instead of a ” * ” is that I have seen queries improve dramatically in performance speed when compared. In this little case, even though there are 100+ columns from each table to list out and all of the data would be returned from each of these, I see a 700 ms improvement in return time on just the “top 1000″ records. When I bump that up to the full result set required by the view, I tire of waiting for the results from either query. Both are terrificly slow (takes longer than 15 minutes to return the full result set).
But But But…
So why do people use Select * in their code? It is a lot easier to type than having to type all of the column names. Because it is fewer keystrokes it would also save time when writing code. Many times, we may also use it when trying to get a glimpse of the system really quick. It is also oftentimes used when doing a demo.
Should it be used? There are cases for using that kind of code such as in an “if exists” block. There are cases where the performance impact is minimal or just as good as listing the columns so it is fine. I use it when I am trying to do a quick and dirty statement while troubleshooting. I don’t like to put that into code to be consumed by applications or reused. I like to know exactly what columns are being returned by my queries and I don’t feel it necessary to return more data than necessary (and consequently consume more resources).
What are some cases you have found a select * to be better than explicitly naming the columns?