T-SQL Tuesday #37
This month please join us in the TSQL blog party that happens on the second tuesday of the month. It is hosted this month by Sebastian Meine (@sqlity).
Sebastien has a month long blog going this month all about Joins. You can read all about that fun in his post titled A Join A Day – Introduction.
This is a good topic. And I pondered what to write this month for a bit. I immediately went to a topic that I had on my toblog list. Unfortunately I had already covered that topic once upon a time. But with it being at the first thing my mind went to when thinking of this topic, I feel it would be good to re-hash it. I may change it up a bit, I may not – we’ll just have to see where this post takes us. So jump on board the Crazy Train for a little Salt N Peppa remix – Let’s Talk About Joins.
Business Requirement
I have some tables that I need to query. One of the tables has lookup information with a bitmask applied to the id field. Another table references this table but the ids can be a direct match or an indirect match to the id of the lookup table. For this case, they will always only differ by the value of 1 if the two values do not directly match. No other match possibility is considered for this example.
Based on this need, the solution dictates some sort of bitwise math. I have several examples of how a join could be written to accomplish this primary objective. I am only going to show the possible ways of performing this. In my environment these all yield the same results and the data is unique and large enough (4.2 million records). I will compare performance of these different queries in a later post as I demonstrate a query tuning method to drive the query down from nine seconds to 100ms or less. For the record, I would choose any of queries 5, 6, or 7 for this particular need based on plan cost and performance.
The JOINS
First up is the ANSI style INNER Join using addition in one of the conditions as well as an OR to fulfill the second portion of the business requirement.
PRINT 'Query 1 -- Join with Or and source+1'
------
SELECT TOP 1000
PPV.RumorID,PPV.PersonRumorID,PPV.PersonID
,US.Source,Us.SourceID, PPV.SourceID
FROMRumorView PPV
INNER Join SourceType US
ON (PPV.SourceID = US.SourceID
Or PPV.SourceID = US.SourceID+1)
This is probably the easiest to understand and it performs well enough. Until running into this business requirement, I hadn’t considered putting an OR in the JOIN conditions. But it makes sense considering that an AND can be used there.
Next is a NON-ANSI style of JOIN.
PRINT 'Query 2 -- Non-Ansi Join with Or and source+1'
------
SELECT TOP 1000
PPV.RumorID,PPV.PersonRumorID,PPV.PersonID
,US.Source,Us.SourceID, PPV.SourceID
FROMRumorView PPV, SourceType US
WHERE (PPV.SourceID = US.SourceID
Or PPV.SourceID = US.SourceID+1)
Through 2008 R2, this works just as well as the ANSI JOIN already shown. I haven’t tested in SQL 2012 but I do know that the NON-ANSI syntax of *= (for example) no longer works. I am not a big fan of this style JOIN because it is far too easy to end up with a Cartesian product.
Another type of JOIN that I like is the use of APPLY.
PRINT 'Query 3 -- Cross Apply with Or and source+1'
------
SELECT TOP 1000
PPV.RumorID,PPV.PersonRumorID,PPV.PersonID
,US.Source,Us.SourceID, PPV.SourceID
FROMRumorView PPV
Cross Apply SourceType US
WHERE (PPV.SourceID = US.SourceID
Or PPV.SourceID = US.SourceID+1)
This particular code segment is the equivalent of the first query shown. This is the last in the set of using basic math and an OR in the JOIN conditions. The remaining queries all rely on bitwise operations to perform the JOIN. Again, until this particular need, I had never even considered using a bitwise operation in a JOIN. First in this series is the NON-ANSI style JOIN.
PRINT 'Query 4 -- Non-Ansi with COALESCE and Bit compare'
------
SELECT TOP 1000
PPV.RumorID,PPV.PersonRumorID,PPV.PersonID
,US.Source,Us.SourceID, PPV.SourceID
FROMRumorView PPV, SourceType US
WHERE (PPV.SourceID|1 = COALESCE(US.SourceID|1,Us.SourceID))
The big change here is in the where clause. Notice the use of COALESCE and the first comparison value in that COALESCE. This is called a BITWISE OR. From MSDN: “The bits in the result are set to 1 if either or both bits (for the current bit being resolved) in the input expressions have a value of 1; if neither bit in the input expressions is 1, the bit in the result is set to 0.”
So I am comparing the bit values of 1 and the SourceID. The SourceID from RumorView will create a match meeting the requirements put forth thanks in large part to the BIT OR operation being performed on both sides of the equality in the WHERE clause. It is also worth mentioning that the COALESCE is completely unnecessary in this query but it I am leaving it as a pseudo reference point for the performance tuning article that will be based on these same queries.
Next on tap is the CROSS Apply version.
PRINT 'Query 5 -- Cross with Bit compare'
------
SELECT TOP 1000
PPV.RumorID,PPV.PersonRumorID,PPV.PersonID
,US.Source,Us.SourceID, PPV.SourceID
FROMRumorView PPV
Cross Apply SourceType US
WHERE (PPV.SourceID|1 = US.SourceID|1)
And the last two queries that the optimizer equate to the same query.
PRINT 'Query 6 -- Join with Bit compare'
------
SELECT TOP 1000
PPV.RumorID,PPV.PersonRumorID,PPV.PersonID
,US.Source,Us.SourceID, PPV.SourceID
FROMRumorView PPV
INNER Join SourceType US
ON (PPV.SourceID|1 = US.SourceID|1)
------
PRINT 'Query 7 -- Join with ISNULL and Bit compare'
------
SELECT TOP 1000
PPV.RumorID,PPV.PersonRumorID,PPV.PersonID
,US.Source,Us.SourceID, PPV.SourceID
FROMRumorView PPV
INNER Join SourceType US
ON (PPV.SourceID|1 = ISNULL(US.SourceID|1,Us.SourceID))
The query optimizer in this case is smart and eliminates the ISNULL. These two queries use the same exact plan, have the same cost and the same execution statistics. The version with COALESCE is considered more expensive and takes longer to run than these queries. It is also important to note that the Cross Apply Join also produces the exact same plan as these two queries.
Conclusion
So there you have it. Many different ways to write the JOIN for this little query. Performance and results may vary. It is good to have a few different ways of writing this particular JOIN. During my testing, it was evident that various methods performed better under different circumstances (such as how the indexes were configured – which will be discussed in the follow-up article).
Yes, I did re-use my previous post on this particular topic. That said, I want to add another tidbit.
This Post just Goes On and On
When Joining objects, you should take care as to the data type used in the Join. Without going into implicit conversions in the Joins, I just want to discuss briefly the impact of data type choice in a Join. This is where knowing your Data, the workload, and usage patterns is going to be an absolute necessity.
That said, you should not be surprised by a change in the performance of your queries if you were to change from a varchar() datatype to a varbinary() datatype. I’m not saying that the query is going to perform better – but that the performance may change. As an example, I was effectively able to turn a well performing query into a very poor performing query by changing from varchar to varbinary. On the flipside, I have seen the reverse also become true. It all boils down to proper data types for the data.