November 1, 2010 at 8:40 am
Greetings All,
Ok, some of you are complaining about the use of the LEFT JOIN. Please allow me to explain. When I am running some reports on the data, I am requested to show which ones fall into certain date ranges and which ones fall into different date ranges. The final set of records will end up having some flags on them for reporting purposes. None of the records are to be excluded just because they don't fall within the date check. So, I can't just throw them away in the script by doing a regular JOIN. A record that does not match one criteria may match another and so must still be included.
This example used temporary tables and created a calendar table. My regular script DOES NOT!!! That was done only to show the logic behind the idea here and to create some test data. If you want to test this better, then you will need to create your own regular table that has a start and end date in it and a ton of records. I also have to be careful what I post for scripts and data. As many of you already know, security and privacy are very important. So, I need to show this with completely different tables and names but still show the logic. I can't show you the real scripts or real data, it has to be completely different or else I get into serious trouble (lose job and/or jail time). I really don't want to have to create a whole test database just to show one idea. If I did that, it would make the attachments on my articles a lot larger.
I am also dealing with 4 or 5 different tables that are linked together for my reports. These tables will be using "Primary Keys" that range between 4 to 7 fields in size. Not a lot of fun to write scripts on in my book. But, I inherited this database when I received this job from the prior DBA. I can't just change the layout of the database as I wish. I have to take what exists and just make it work. Sort of like when the street crew has your main highway closed for construction and you have to find a way through the back roads to get to work. Your boss usually doesn't care about the transportation hurdles, he just wants you at your desk on time. You can't just use the fast highway like you want so you have to be inventive with what you have.
This logic was only a small portion of a much larger script I had to do a major report. But, this also was the biggest slowdown in that report, (the BETWEEN dates). This one change made the report run a lot faster. Everything else I had tried, including adding several different indexes, did not. The report had to analyze the data on several levels so sometimes it's not easy to find just the right index. When I did find a way that worked very well, I wanted to share that idea with others to help them too. Showing the original script would not be good as this one idea would have been lost in the complexity of the entire script. So, splitting it out into a smaller script to highlight the one idea and adding some test data was hopefully a bit better. If you have some ideas on how to write this better, then fine, please share. I will always be a learning DBA and am willing to accept constructive advise.
Have a good day.
Terry Steadman
November 1, 2010 at 8:44 am
GPO (11/1/2010)
If it was your job to design these tables, would you be tempted to have an end_date_time column that could not be null, and instead substitute a ridiculous future date for all events that have not yet ended (like 22000101)? Are there indexing or other performance advantages to making your end_date_time column not null?
I would not be the least bit tempted to do that. Fake data is the bane of my existence.
November 1, 2010 at 9:04 am
terrance.steadman (11/1/2010)
Ok, some of you are complaining about the use of the LEFT JOIN. Please allow me to explain. When I am running some reports on the data, I am requested to show which ones fall into certain date ranges and which ones fall into different date ranges. The final set of records will end up having some flags on them for reporting purposes. None of the records are to be excluded just because they don't fall within the date check. So, I can't just throw them away in the script by doing a regular JOIN.
But you ARE doing a 'regular' join in your "new" query. You're left joining on one end date, and inner-joining on the other. That's the entire point. The two queries don't yield identical results in all cases. And you're surprised by their having differing run times? The entire test (and the story resulting from it) is meaningless.
Does anyone verify these stories before they're posted?
November 1, 2010 at 9:05 am
Hi Terry,
Thanks for the article!
In a data warehousing environment with larger data I am always looking for ways to enhance the query performance. I have to put your approach into test and see how it works.
Thanks again.
Moe M
Database Consultant
http://www.cubeangle.com
November 1, 2010 at 9:10 am
i know it's a big no-no, but i have a few tables for logging/admin duties and i just created the clustered index on the date column on those. most of the toughest queries on those tables select the last few days of data or data between 2 days. in the case where this is not the case, putting the clustered index on the date columns also causes other records that are joined between 2 tables to be on the same pages making the join faster as well.
why not just cluster the date column instead of creating a date table? doesn't work in all cases like when you need to calculate specific weeks and days, but if you need to select a few days of data it's a pretty good solution
November 1, 2010 at 9:14 am
If it was your job to design these tables, would you be tempted to have an end_date_time column that could not be null, and instead substitute a ridiculous future date for all events that have not yet ended (like 22000101)?
I would consider using a sentinel date such as the maximum T-SQL datetime (9999-12-31) for two reasons: 1) your queries become simpler since you don't always need that "or end_date is null" part of the clause, and 2) you can create an index on end_date since it is not nullable.
November 1, 2010 at 9:18 am
Terry,
Here's a short, simple script that shows your "improved" query doesn't yield the same results as the original. The first query returns 3 rows; the second only two, for the results I explained in my previous post.
CREATE TABLE Test1
(
Date datetime
)
INSERT INTO Test1 SELECT '01/01/2010'
UNION ALL SELECT '02/01/2010'
UNION ALL SELECT '03/01/2010'
CREATE TABLE Test2
(
startDate datetime,
endDate datetime
)
INSERT INTO Test2 SELECT '01/01/2010', '02/01/2010'
-- Query 1: Returns 3 rows
SELECT *
FROM Test1 LEFT JOIN Test2 ON Date BETWEEN startDate and enddate
-- Query 2: Returns only 2 rows: a portion of the join predicate is now an inner join, not a left join
SELECT *
FROM Test1 LEFT JOIN Test2 ON Date >= startDate
WHERE
(
CASE
WHEN Date <= endDate THEN 1
ELSE 0
END
) = 1
November 1, 2010 at 9:24 am
david.avraamides (11/1/2010)
I would consider using a sentinel date such as the maximum T-SQL datetime (9999-12-31) for two reasons: 1) your queries become simpler since you don't always need that "or end_date is null" part of the clause, and 2) you can create an index on end_date since it is not nullable.
Eh? You can index on a nullable column.
November 1, 2010 at 11:09 am
I too had a process that took 16 minutes using the standard between. Using your method resulted in 34 minutes (using 2008).
As an example, I had a script that was taking over 5 minutes to process. Adding this one change to the script reduced its processing time down to 30 seconds. Definitely a happy increase in speed. When I had ran this test script 3 times, I had the process time of 5 seconds, 3 seconds, and 3 seconds respectively.
Thanks you for your alternative method, but I might suggest you display explain plans in the future to prove your finds. Also, your quote above mentions going from 30 seconds, then to 5,3,3. I'm sure you're aware that you likely have cached results there.
November 1, 2010 at 11:13 am
I agree with the previous comments, your second query is no longer a real LEFT JOIN and won't always return the same results as the first query. Maybe in your case it does, but according to your comment of not leaving rows out, you might have a problem.
You might as well make the query an INNER join so that it is more readable, you get the same results:
SELECT
*
FROM Test1
INNER JOIN Test2
ON Date >= startDate
AND CASE
WHEN Date <= endDate THEN 1
ELSE 0
END = 1
If you really want the left join you could try this:
SELECT
*
FROM Test1
LEFT JOIN Test2
ON Date >= startDate
AND CASE
WHEN Date <= endDate THEN 1
ELSE 0
END = 1
But I suspect it won't get the speed benefit you want.
November 1, 2010 at 12:32 pm
UMG Developer (11/1/2010)
You might as well make the query an INNER join so that it is more readable, you get the same results:
Actually, that doesn't give the same results either. He's split the join predicate into two halves; one of which uses an outer join, the other an inner. Clearly not the results he wanted, and one that invalidates the "optimization" itself. Moving to an inner join is the entire reason the query runs faster; it has nothing at all to do with him splitting the BETWEEN into two checks (SQL Server's engine does that internally anyway).
I'll also add that the boolean substitution he performs is entirely spurious, and further serves to cloud the issues here.
November 1, 2010 at 12:36 pm
ebaya (11/1/2010)
UMG Developer (11/1/2010)
You might as well make the query an INNER join so that it is more readable, you get the same results:Actually, that doesn't give the same results either. He's split the join predicate into two halves; one of which uses an outer join, the other an inner. Clearly not the results he wanted, and one that invalidates the "optimization" itself. Moving to an inner join is the entire reason the query runs faster; it has nothing at all to do with him splitting the BETWEEN into two checks (SQL Server's engine does that internally anyway).
I'll also add that the boolean substitution he performs is entirely spurious, and further serves to cloud the issues here.
I was trying to say that the INNER JOIN version is the same as his second/alternate query and gives the same result as that. It certainly doesn't give the same results as the original query.
November 1, 2010 at 1:31 pm
Did you clear the buffers between tests? Using: dbcc dropcleanbuffers
If not, it would explain your "performance gain", as the result is already in memory. Your "faster" query would perform a table scan every time in real life, as your condition is an expression... and a case statement at that.
I agree with the many other replies; with proper indexing, between/and or a simple >=/<= to test the range is extremely performant. We have a table with 80,000,000 rows and growing that is always queried by a date range, and the queries execute in 1-3ms. If partitioned, it's less than 1ms.
November 1, 2010 at 1:42 pm
I would agree with earlier comments that if your table is indexed properly dates shouldn't be an issue. Our main order table has many tens of millions of rows and the Order Date field is one of the indexes which we use in many reports for BETWEEN checks - and speed is never an issue.
A novel approach perhaps, but I would prefer to bug a standard BETWEEN statement to something more unusual.
November 1, 2010 at 2:34 pm
jpenniman (11/1/2010)
Did you clear the buffers between tests? Using: dbcc dropcleanbuffers. If not, it would explain your "performance gain"
His illusory performance gain is explained entirely by the fact that it's an entirely differentquery, that will return different results. You don't need to invoke any buffering or anything else to explain it.
...as the result is already in memory. Your "faster" query would perform a table scan every time in real life, as your condition is an expression... and a case statement at that.
Eh? You can buffer an expression as you can anything else. You simply can't index off it (in technical terms, its not SARGable).
Viewing 15 posts - 16 through 30 (of 53 total)
You must be logged in to reply to this topic. Login to reply