Large query with logic and filtering: SP, UDF, or View?

  • Grr...I typed out the whole post and hit Preview and it ate it.    Here we go again...

    I am on Server 2000 and I have a puzzling situation:

    The gist is I have a query that returns a rather large set of records (~ 4 million) on which I will perform aggregate functions and other statistical analyses later on.  The query uses some logic in several fields, in which one case pulls a value into the field from one table, but in which another case pulls a value into the field from a different table.  Here is the general idea, but the actual query is a bit longer:

    SELECT col1, (CASE WHEN firstTable.logicValue = 1 THEN firstTable.val ELSE secondTable.val END) AS col2, firstTable.startDate, firstTable.stopDate FROM firstTable LEFT OUTER JOIN secondTable ON firstTable.ID = secondTable.ID

    Let's call the set of records this returns "BigTable".  I have this query implemented in a view right now--let's call it vwBigTable--and it works fine.  The problem, though, is that this takes several seconds to run due to all the logic and the large number of records.  So what I want to do is allow the user to subset this query by date in hopes of speeding it up a bit, which would entail passing a the start and stop dates as parameters to the view/SP/UDF.  But I need to be able to run queries like "SELECT * FROM vwBigTable", so a stored procedure won't work in this context ("SELECT * FROM stprocBigTable" bombs out), but I can't pass my date parameters into a view.  So I'm stuck.

    My rationale is this:  I *could* just do "SELECT * FROM vwBigTable WHERE startDate >= '7/5/2005'" but it's instinctive to me that it would be faster if I could do something more like "SELECT * FROM fnBigTable('7/5/2005')" (which is what I want).  In the former case, I suspect that vwBigTable has to completely run and therefore do all the CASE comparisons and THEN subset the whole thing by date.  I would think in the latter case that the date filtering would be done *before* the CASE logic was applied to *all* the records.  

    I'm relatively new to SQL Server so my instincts here may be wrong, but in any event I would really appreciate any suggestions you can offer.

    Thank you!

  • Have you tested each case (view/function)?? The optimizer should come up with the same plan.

    Since you have a left join am indexed view is out of the question.

    I think that the fasted plan you can come up with would be to have select * from Table where datecol between @Start and @End, assuming that DateCol is the clustered index. I can't tell you more without having the actual DDL with indexes.

  • besides Remi's suggestions wich you must implement I would advice you not to write select * from vwBig... just request the fields that are going to be used and not a single more. If the number of records retured are too much you shoud be looking at some offline consolidation strategy also

    Just my $0.02

     


    * Noel

  • My use of * in the examples was purely for illustrative purposes and brevity.

    The number of records returned is not too great.  The problem is the execution time.

  • OK, thanks very much.  I didn't think the execution plan would be the same for both, but it would make me happy if it is.  I may test it out and see if they are indeed the same.

    Unfortunately I am writing this code for a database I didn't design, and I have a feeling the big problem here is that the database is not making good use of indices.

  • It has been my experience that views are usually ok and fuctions sometimes are sometimes aren't so you will need to test both!!

    As for indexes yes you have to make sure you are getting seeks and not scans, as much as possible

     


    * Noel

  • On 4M rows, scans must be out of the question and stay that way...

    Can you post the execution plan of the query you think works the best? Maybe we can send some pointers on it.

  • ON 4M rows yes but not on : SELECT * FROM vwBigTable WHERE ...


    * Noel

  • The gist is I have a query that returns a rather large set of records (~ 4 million) on which I will perform aggregate functions and other statistical analyses later on. The query uses some logic in several fields, in which one case pulls a value into the field from one table, but in which another case pulls a value into the field from a different table. Here is the general idea, but the actual query is a bit longer:

    The where better gives a clustered index seek and no scan index scan on further indexes .

  • I can't post that due to the sensitive nature of the data here but I can say you guys were correct that it is doing clustered index scans, so clearly some work is needed.  I will talk with the DBA and see if we can get the indexes up to snuff.

    Thanks a lot for your advice; it's greatly appreciated.

  • If you post the query and the execution plan we can give a hand... if he's too busy.

  • Okay here's what I've got.  I've changed the table and field names to protect the innocent, so I hope this still makes sense.  The simple query query:

    SELECT * FROM vwBigTable where StartTime > '4/4/05' AND StopTime < '7/6/05'

    Here is the view query with a heavily shortened SELECT clause (the actual clause has about 22 fields, 9 of which are decided by logic which chooses which table to pull those 9 values from):

    CREATE VIEW dbo.vwBigTable AS

    SELECT Orig.FileID, Orig.Filename, Orig.FileSize, Orig.NumberOfRecords, CASE WHEN Cor.FileID IS NULL THEN Orig.StartTime ELSE Cor.StartTime END AS StartTime, CASE WHEN Cor.FileID IS NULL THEN Orig.StopTime ELSE Cor.StopTime END AS StopTime, FROM dbo.tblOriginal Orig LEFT OUTER JOIN dbo.tblCorrections Cor ON Orig.FileID = Cor.FileID

    Here is the point of this mess:  tblOriginal has about 63,000 (NOT 4M as I previously said; some reworking has been done so let's say 63K for now, sorry about that) records or so, while tblCorrections has about 12.5K.  The Corrections table contains a select 9 fields that we know are sometimes wrong in the Original table (e.g. StartTime and StopTime in this example) for some FileIDs, and what I'm trying to do with my view is look at the "merged, corrected" table.  The tblCorrections structure does *not* mirror tblOriginal--it only contains 9 of the fields that are in tblOriginal, which is why I have to do all this absurd field-by-field logic.

    So let's say tblOriginal has FileIDs from 1 to 63,000, and tblCorrections has 12.5K FileIDs sparsely spread between 1 and 63,000.  I do a left join between tblOriginal and tblCorrections, and wherever tblCorrections.FileID isn't null, I know there is a Corrections entry for that fileID, and I want to pull each of my 9 fields into the view from tblCorrections; if tblCorrections.FileID IS null, there are no available corrections so I pull into the view the 9 values from tblOriginal.

    Currently this query takes 16 seconds and returns 62K rows.  Execution plan (is there a way to get this from Server in text?):

    1a) Clustered Index Scan: tblOriginal.PK_tblOriginal (75%), Row Count 62803, Estim Row Size 105

    1b) Clustered Index Scan: tblCorrections.PK_tblCorrections (9%), Row Count 12452, Estim Row Size 207

    2) Merge Join / Right Outer Join (10%), Row Count 62803, Estim Row Size 287

    3) Filter (7%), Row Count 62800, Extim Row Size 287

    4) Compute Scalar (0%), Row Count 62800, Estim Row Size 184

    5) SELECT (0%), Row Count 62800

     

    Steve

  • SET SHOWPLAN_TEXT ON

    GO

    Select id, name from dbo.SysObjects

    GO

    SET SHOWPLAN_TEXT OFF

    >>>

    StmtText

    ----------------------------------------------------------------------------

    |--Index Scan(OBJECT[Documentation].[dbo].[sysobjects].[ncsysobjects]))

    I'll wait for the actual plan to make a final verdict but I think this can be optimized a lot still.

  • OK here's all I get out of it.  Sadly a lot of it gets cut off but you get the idea.  Again, names were changed where necessary.

    SELECT * FROM vwBigTable where starttime > '4/4/05' AND stoptime < '7/6/05'

      |--Compute Scalar(DEFINE: ([Expr1004]=If ([tblCorrections].[FileID]=NULL) then [tblOriginal].[SomeField] else [tblCorrections].[SomeField], [Expr1005]=If ([tblCorrections].[FileID]=NULL) then [tblOriginal].[StartTime] else [tblFileID...

    |--Filter(WHERE: (If ([tblCorrections].[FileID]=NULL) then [tblOriginal].[StartTime] else [tblCorrections].[StartTime]>Convert([@1]) AND If ([tblCorrections].[FileID]=NULL) then [tblOriginal].[StopTime] else [tblBlo...

    |--Merge Join(Right Outer Join, MERGE: ([tblCorrections].[FileID])=([tblOriginal].[FileID]), RESIDUAL: ([tblOriginal].[FileID]=[tblCorrections].[FileID]))

    |--Clustered Index Scan(OBJECT: ([MyDatabase].[dbo].[tblCorrections].[PK_tblCorrections]), ORDERED FORWARD)

    |--Clustered Index Scan(OBJECT: ([MyDatabase].[dbo].[tblOriginal].[PK_tblOriginal]), ORDERED FORWARD)

  • I can't help you if you don't show me the whole thing. The only information I can give you is that it would probabely be faster to switch the clustered index to the date columns on each table and filter on that first (use 1 big query instead of a query on the view to filter directly).

    Can you tell us what columns are indexed?

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply