November 15, 2007 at 9:49 am
The following 'treatise' comes about as a result of a gut instinct that the mechanism employed by a (name withheld to protect the guilty) developer was suboptimal and there must be a better, faster, stronger way to do the same thing.
The premise:
Create a single query with multiple filter criteria that allows the return of a result set of rows matching those criteria that have specified values and allowing the return of any value for which the criteria is specified as NULL.
If a table has columns A & B and the filter is A=@A, B=@B:
When @a=123, @b-2=456 returns rows WHERE A=123 & B=456
When @a=213, @b-2=NULL returns rows WHERE A=132 & B=any value
It is a type of generic query that is useful for generating result sets for reports, lists, etc. when multiple criteria MAY be used but not necessarily and precludes coding separate queries for each possible combination filter criteria.
A little web searching and posts within this forum produced four basic static (i.e. not dynamic SQL) queries that generated the desired results. I am sure there are more variations, algorithms and mechanisms. I am sure they will magically find there way here, too! 😉
Disclaimer: This is a quick & dirty. No warranty written or implied. Demo use only. Don't try this at home. No user serviceable parts inside -- get the idea?
Build the test table:
-- Run this as is on a NON-PRODUCTION server
-- Use the SELECT'd values of A, B & C as the variable values for @varA, @varB & @varC, respectively in second code section.DynQuery TestQuery.sql
-- NULL each variable and combinations of variables to alter the query plan.
-- Compare the estimated query plans - note the index used in each
-- Drop indexes (commands at end of this script) to alter the plans & rerun queries
-- This may/should be executed several times to generate new values
-- The value of @maxRows may be altered to generate larger/smaller tables
USE tempdb
go
RAISERROR('This may take a moment...', 10, 1) WITH NOWAIT
go
SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
SET NOCOUNT ON
IF OBJECT_ID('dbo.tempTbl') IS NOT NULL
DROP TABLE dbo.tempTbl
CREATE TABLE dbo.tempTbl
(
PK int IDENTITY(1,1) NOT NULL,
A int NOT NULL,
B int NOT NULL,
C int NOT NULL,
CONSTRAINT PK_tempTbl
PRIMARY KEY CLUSTERED (PK ASC)
)
DECLARE
@maxRows int,
@rndA int,
@rndB int,
@rndC int,
@cnt int
SET @cnt = 0
-- Alter @maxrows to change the table row count
SET @maxRows = 50000
-- slap some values in the table - not efficient but it works
WHILE @cnt < @maxRows
BEGIN
SET @rndA = CAST(RAND() * 101010.0 AS int)
SET @rndB = CAST(RAND() * RAND(@rndA) * 3633027.3 AS int)
SET @rndC = CAST(RAND() * RAND(@rndB) * 211.7 AS int)
INSERT INTO dbo.tempTbl
(
A,
B,
C
)
VALUES
(
@rndA,
@rndB,
@rndC
)
SET @cnt = @cnt + 1
END
-- Index this puppy six ways from sunday!
CREATE NONCLUSTERED INDEX A
ON TempTbl (A)
CREATE NONCLUSTERED INDEX B
ON TempTbl (B)
CREATE NONCLUSTERED INDEX C
ON TempTbl (C)
CREATE NONCLUSTERED INDEX AB
ON TempTbl (A, B)
CREATE NONCLUSTERED INDEX AC
ON TempTbl (A, C)
CREATE NONCLUSTERED INDEX BA
ON TempTbl (B, A)
CREATE NONCLUSTERED INDEX BC
ON TempTbl (B, C)
CREATE NONCLUSTERED INDEX CA
ON TempTbl (B, C)
CREATE NONCLUSTERED INDEX CB
ON TempTbl (B, C)
CREATE NONCLUSTERED INDEX ABC
ON TempTbl (A, B, C)
CREATE NONCLUSTERED INDEX ACB
ON TempTbl (A, C, B)
CREATE NONCLUSTERED INDEX BCA
ON TempTbl (B, C, A)
CREATE NONCLUSTERED INDEX BAC
ON TempTbl (B, A, C)
CREATE NONCLUSTERED INDEX CAB
ON TempTbl (C, A, B)
CREATE NONCLUSTERED INDEX CBA
ON TempTbl (C, B, A)
SELECT TOP 1 *
FROM tempTbl
WHERE pk = CAST(RAND() * 50000.0 AS int)
-- Drop indexes individually & rerun query
-- DROP INDEX dbo.tempTbl.A
-- DROP INDEX dbo.tempTbl.B
-- DROP INDEX dbo.tempTbl.C
-- DROP INDEX dbo.tempTbl.AB
-- DROP INDEX dbo.tempTbl.AC
-- DROP INDEX dbo.tempTbl.BA
-- DROP INDEX dbo.tempTbl.BC
-- DROP INDEX dbo.tempTbl.CA
-- DROP INDEX dbo.tempTbl.CB
-- DROP INDEX dbo.tempTbl.ABC
-- DROP INDEX dbo.tempTbl.ACB
-- DROP INDEX dbo.tempTbl.BAC
-- DROP INDEX dbo.tempTbl.BCA
-- DROP INDEX dbo.tempTbl.CAB
-- DROP INDEX dbo.tempTbl.CBA
---------------------------
-- End of test table build
---------------------------
OK, run the above to generate the test table. While that's going you can copy the below into a new Query Analyzer window. When the table generation is complete it will return a row. Use the values (or rerun the SELECT at the bottom for new values) to assign to the variables below:
-- NULL each variable and combinations of variables to alter the query plan.
-- Compare the estimated query plans (note: a third option was added just for kicks)
-- Also execute the queries proper & note the IO statistics on the messages tab
--
SET NOCOUNT OFF
USE tempdb
DECLARE
@varA int,
@varB int,
@varC int
-- replace the values below with those returned by the table generation script
SET @varA = 94240
SET @varB = 1486844
SET @varC = 14
SELECT
*
FROM tempTbl
WHERE
(A = @varA OR @varA IS NULL) AND
(B = @varB OR @varB IS NULL) AND
(C = @varC OR @varC IS NULL)
SELECT
*
FROM tempTbl
WHERE
(A = @varA AND @varA IS NOT NULL) AND
(B = @varB AND @varB IS NOT NULL) AND
(C = @varC AND @varC IS NOT NULL)
SELECT
*
FROM tempTbl
WHERE
A = COALESCE(@varA, A) AND
B = COALESCE(@varB, B) AND
C = COALESCE(@varC, C)
SELECT
*
FROM tempTbl
WHERE
A = ISNULL(@varA, A) AND
B = ISNULL(@varB, B) AND
C = ISNULL(@varC, C)
-----------------------------------
-- End of test script
-----------------------------------
OK, the table has been generated and the values returned copied to their respective variables, right?
Now click the "Display Estimated Execution Plan" button and review the results. Note the index used by each query. Note the differences in the execution paths. Note the difference in the COALESCE() vs the ISNULL() versions - not terribly surprising, I guess...
Now start playing with the indices. Drop the index used by the 'most efficient' query and check the plans again. Note any trends in the query plans. Repeat this delete-rerun process until you are exhausted, lunch time or the boss wants you to give an end-user the sa password (whichever comes first).
In my fiddling with it, one of the queries consistently beat the others by approximately 20x until all of the indices were dropped. The IS NOT NULL version...
I did not try sp_executesql or EXEC query versions but it would be interesting to see how they faired.
Play on!
Art
November 16, 2007 at 2:13 am
Most interesting ...
I guess the most likely indexing regime would be the single index per column (i.e INDEX A, B and C) - and in that mode the IS NOT NULL method cost is only about 1.4% of batch in my testing and the IS NULL and IsNull() methods are almost identical (0.01% difference which could well be a rounding issue) COALESCE was marginally worse - looking at the detail of the clustered index scan that still takes place on the PK it would seem that it might be because TSQL does not natively understand COALESCE as it seems to have translated it into IF's whereas the IsNull() is shown as just that - that would fit with MS view of the world - why natively support part of the SQL standard rather than your own proprietary function!
But the result is that the conveniently neat shorthand of the IsNull() method must be replaced by the wordier but massively more efficient IS NOT NULL one
Thanks for taking the time to setup the tests
James Horsley
Workflow Consulting Limited
November 16, 2007 at 7:00 am
The problem with all of those 'all-in-one' type queries is that they make query plan reuse a very dodgy affair. What may be suitable for one set of parameters often isn't for another set.
The construct (ColA = @a or @a is null) AND (ColB = @b-2 or @b-2 is null) AND ... and similar ones cannot be properly analysed by the optimiser and it tends to assume that the second half of the OR is true for all, or false for all. As you can imagine, this doesn't lend itself well to optimal plans
(btw, the second query gives a different result from the others for the case where @varA and @varB have values, and @varC is null. Retunrs 0 rows whereas the other 3 return 310 in my test)
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 16, 2007 at 9:02 am
There is another issue with the IS NOT NULL version too. NULLing all variables returns an empty set.
Back to the drawing board...glad I didn't erased it!
November 16, 2007 at 9:28 am
No wonder it was faster ....
If any of the variables are Null if will be doing a compare with a Null which is a no no -- perhaps this simple switch would make a difference (I have trashed the test table already my end...)
SELECT
*
FROM tempTbl
WHERE
(@varA IS NOT NULL AND A = @varA) AND
(@varB IS NOT NULL AND B = @varB) AND
(@varC IS NOT NULL AND C = @varC)
As that would possibly not bother doing the A=@VarA if it has already seen that @VarA was Null
James Horsley
Workflow Consulting Limited
November 17, 2007 at 1:52 pm
November 19, 2007 at 12:01 am
James Horsley (11/16/2007)
No wonder it was faster ....If any of the variables are Null if will be doing a compare with a Null which is a no no -- perhaps this simple switch would make a difference (I have trashed the test table already my end...)
Same thing. If any of the variables are NULL, the resultset has no records.
As that would possibly not bother doing the A=@VarA if it has already seen that @VarA was Null
SQL doesn't use short-circuit evaluation of the form that some programming languages do. The order that criteria are specified in the where clause has no effect on the order in which they are evaluated. The evaluation order depends on the available indexes, the distribution of data and the plan that the optimiser comes up with, based on those.
Is a failry common thing I hear, about putting constant expressions first in the where clause, or putting the null comparisons first. I wrote a more detailed piece on this, complete with exec plans: Order of execution[/url]
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 19, 2007 at 3:28 am
So MS could add a performance boost in SQL2009 by short circuiting ...
James Horsley
Workflow Consulting Limited
November 21, 2007 at 9:36 am
I haven't given up but it appears my gut instinct cannot be satisfied (yet)...
I snagged the 'IS NOT NULL' from the URL mentioned above (http://www.sommarskog.se/dyn-search.html). Two problems surfaced with that. The first is that the code provided required at least one of the filter parameters to be populated. I was playing with that code in order to allow it the capability of returning all rows if all of the parameters were NULL.
The second problem was that the code I snipped for the test script was the wrong code (doh!). In order to make the query work the WHERE clause needs to take into consideration, via a separate comparison, the condition when all parameters are NULL. The corrected clause is:
WHERE
(A = @varA AND @varA IS NOT NULL) OR
(B = @varB AND @varB IS NOT NULL) OR
(C = @varC AND @varC IS NOT NULL) OR
(@varA IS NULL AND @varB IS NULL AND @varC IS NULL)
On SQL2000, this change made the query fall behind the original IS NULL version. Interestingly, on SQL2005, all four versions were equal regardless of indexing.
Another irritating but probably not as critical issue is the distribution of values within the test table. For any value in column A, the values in B and C would be the same for all rows.
The new testTbl row generation I'm using produces a better distribution:
DECLARE
@maxRows int,
@rndA int,
@rndB int,
@rndC int,
@incVal int
SET @incVal = 1
-- Alter @maxrows to change the table row count
SET @maxRows = 50000
-- slap some values in the table
INSERT INTO dbo.tempTbl (A, B, C)
SELECT 0, 0, 0
WHILE @incVal <= @maxRows
BEGIN
INSERT
INTO dbo.tempTbl (A, B, C)
SELECT 0, 0, 0
FROM
(
SELECT PK + 1 dtCol
FROM dbo.tempTbl
WHERE PK <= @maxRows - @incVal) dt
SET @incVal = @incVal * 2
END
UPDATE dbo.tempTbl
SET
A = FLOOR(RAND(PK) * 633027.3),
B = FLOOR(1 / (RAND(PK) / PK)),
C = FLOOR((RAND() / PK) * 1111111.1)
Of course, I have not beaten this horse sufficiently and will continue to do so until I either have a better solution or dog food...
Art
November 21, 2007 at 11:15 am
R2ro (11/21/2007)
...WHERE
(A = @varA AND @varA IS NOT NULL) OR --line1
(B = @varB AND @varB IS NOT NULL) OR --line1
(C = @varC AND @varC IS NOT NULL) OR
(@varA IS NULL AND @varB IS NULL AND @varC IS NULL)
....
Art
This STILL doesn't meet your initial criteria because the logic is entirely wrong. What happens when only @varC is null? you get all records that match either line1 or line2 but not necessarily both. This particular horse is lame there doc, time to apply "equine medicine" to quote a far side strip of yore....
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 21, 2007 at 11:34 pm
James Horsley (11/19/2007)
So MS could add a performance boost in SQL2009 by short circuiting ...
I don't think so.
Bear in mind that SQL is a declarative language, not a procedureal one. You state what you want and the engine takes care of the details of how. The order of evaluation is part of the how. In a procedural language, that's the responsibility of the programmer. In a declarative language it is not.
The query engine does do a form of short circuiting, evaluating some criteria before others. Only rows that satisfy the earlier applied conditionas are evaluated for later ones. It's just that the query writer has little control over which criteria get evaluated in which order.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 22, 2007 at 2:19 am
I don't think so.
Surely it isn't beyond the realms of what the optimiser is doing ... but as database is one area MS does have real competition I guess if it was easy it would have been done!
James Horsley
Workflow Consulting Limited
November 24, 2007 at 5:24 am
So that we're on the same page, what are you thinking about when you say short-circuit?
What C++ (and other C-like languages have) where it the first expression in a condition determines the outcome, the second is not evaluated?
Something else?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 24, 2007 at 5:42 pm
Yes that's just what I mean - so if you have say IF (A!=B OR C!=D) then there is no need to evaluate C!=D id A did not equal B because we are Or'ing
James Horsley
Workflow Consulting Limited
November 25, 2007 at 12:15 pm
But that means that the order of statements in the where clause now becomes important. That will absolutely cripple the optimiser as it will not be allowed to look for plans that evaluate the third statement (say of 4) in the where clause first, in the case that there is an effective index available on the third but none on the others.
The way the optimiser currently works is it will check which of the conditions in the where clause (or in the from clause) it can use to quickly reduce the number of rows involved in latter operations. A table scan followed by a filter is less desired (and has a higher cost) than an index seek
At the moment, the optimiser is a little shaky with constant evaluations in OR statements, where the other part of the OR references a column. Maybe that will change in future versions.
Remember as I said before. SQL's a declarative language, not a procedural language. In a declarative language (like prolog as well) you don't tell the execution engine what to do. You just tell it what you want out.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply