In Vs. Inner Join
This is tightly related to another of my forays into tuning some slowly/poorly performing processes. This process came across my desk due to some reporting inaccuracies. The report is derived from data in a downstream process that is based on the data provided from this top level process. More precisely, this is an ETL job that pumps data into the warehouse for consumption by several other downstream ETL jobs and reports. In order to effectively troubleshoot, I wanted to follow the data all the way back to the source. I had to verify that all was working correctly from end to end in the process chain for this report. I ended up tuning a script that was a critical piece of this ETL process. (Oh and the data inconsistency was also corrected here and in the downstream report – there were a couple of consistency problems).
So, let’s start with examples of some of the problem areas with the script. First, it took an average of 12 minutes to execute. This slowness was consistent whether executed from the dts package or from SSMS. Second, it was not a stored proc. This particular script is used in more than one location. If any changes need to be made to the logic, one must find all locations rather than edit the stored proc. Third, it relied on sub-queries and correlated sub-queries. Fourth, the use of date comparison operators in the Join statements. These were additional Join conditions that I needed to test for relevance and impact on the result set. That should be enough to discuss about this script. Fifth, persistent use of the NOLOCK hint. In this scenario, it was used as an optimization ploy. There were other “problems” with the rest of the process related to this script, suffice it to say I corrected some of those issues too and am working to correct the rest.
I am going begin with the third problem stated. The first issue will lend itself to being resolved by the end of this article as a part of the overall process. The second issue will be the last item treated and should just be understood by the end that I put the query into a proc and altered the dts package to call the proc instead of running the script.
Sub-Queries
What is a sub-query? A sub-query is a query that is found attached to a Join clause or in the Where clause. It is, as the name implies, a query that is run as a part of an outer/parent query. From msdn, we learn that:
“many queries can be evaluated by executing the subquery once and substituting the resulting value or values into the WHERE clause of the outer query.”
This doesn’t sound so bad. The query could be executed once and then return the necessary results to the where or join conditions of the outer query. In my scenario, the sub-queries are run twice each. They are a part of a Union select statement and the same query will appear on both sides of the Union.
LEFT JOIN (SELECT SSN, avg(PBO) PBO, min(ConDate) ConDate
, cast(convert(varchar(8),min(RecordDate),112) AS smalldatetime) RecordDate
FROM dbo.DOE_CONCampaign_Funded (nolock)
WHERE ServicerID = 1234
GROUP BY SSN) f2 ON t.SSN = f2.SSN AND ss.SubmitDate < f2.ConDate
In this example we see a few of the items that I listed. Since I am running this on a SQL Server 2005 box, I decided to move each of those sub-queries to a CTE. This change gave me an improvement in time of about three minutes. (Even as I test currently, I am seeing that the original query is now taking in excess of 23 minutes – ouch!)
Now on to that
Union statement. I modified the Union to be a Union All. Without specifying the All, a Distinct is performed on the query being performed. I needed to determine first if I truly needed to perform a Distinct operation, or if I could use the Union All statement without altering the result-set. Testing both versions, I ended up with the same result set. Based on this I am pretty confident that I can proceed with using Union ALL and producing another time savings of about 1 minute on this query. This is good news.
Correlated Sub-Queries
Next up is the use of the correlated subqueries. This query uses a correlated subquery on each side of the union statement. The correlated sub-query looks like the following for the top half of the statement.
WHEREc.CampaignID IN (1, 2, 3, 4, 5, 6, 7, 8, 9)
I recently had to add the last 5 campaigns to this list. I see a couple of problems with this. First off, it is cumbersome to edit code to fix these hard-coded values. Some code somewhere will likely get missed. The second pain point was the slowness this caused my query. The resolution was to create a “Control” table to which these values would be added. The code would be corrected in all locations and would use a Join instead of the correlated sub-query. What does this buy me? My performance now becomes very consistent and is less than 20 seconds without any further indexing. One more thing to note here is that the IO Stats are a going to be different for multiple reasons at this point. The original IO Stats above only show the first four campaigns listed in the correlated Subquery. I am currently working with the entire set of Campaigns therein listed.
The replacement for the correlated sub-query is an Inner Join back to the control table.
INNER JOIN dbo.LCComm_CampaignControl cc
ON c.campaignid = cc.campaignid
AND cc.campaigntype = 1
This is the top half of the Union Select statement. The bottom half is slightly different. The table has an additional field to help designate type of campaign. This means less code rework and easier maintainability on my part. If I want to exclude a campaign from the query – remove it from the table.
Extraneous Join Conditions
In this query there is the use of datefields in the Outer Join clauses. The dates used are comparisons only and needed to be tested for contribution to the query. At this point the testing is simple – run the query as is and then rerun with each of those uses of date comparisons commented out in the Join Clauses.
As-is query
LEFT OUTER JOINq
ON t.SSN = q.accstrSSN
AND ss.SubmitDate < q.LVCRequestDate
Changed query
LEFT OUTER JOINq
ON t.SSN = q.accstrSSN
--And ss.SubmitDate < q.LVCRequestDate
I was a little surprised to see that both forms of the query returned the same result set. In light of that, the fact that there appears to be no visible performance gain by eliminating these Join Conditions is not astonishing to me. Thus I can leave them or eliminate them – they don’t add or subtract from the query performance for my current purposes. I decided to remove them anyway. At this point, the IO Stats remain unchanged from the previous sampling. I would now need to examine indexing to improve performance.
Conclusion
This exercise is yet another one of those opportunities to performance tune a process. This one had been left alone because it “was working” and people wanted to leave well enough alone. When processing takes that long, well enough is not good enough. The systems need to be able to move on to other processing tasks and use the resources in a more efficient manner than to run the same process for an excess of 12 minutes (24 minutes today during testing to get data for this article). This is only a piece of the puzzle to tuning this particular process. This query was tuned, altered, put into a stored proc, and the nolock hints removed as some of the adjustments. In the end, it returns the correct data set and does it in 1/72nd of the time it took earlier today and 1/36th the average time. I would say that is substantial cost savings.