Index Tuning Wizard On Multiple Databases, and Joins From Calculations

Question

Index Tuning Wizard On Multiple Databases, and Joins From Calculations

Steve Fontaine

SSC-Addicted

Points: 415
More actions
August 15, 2005 at 2:14 pm

#110808

I'm having trouble running the Index Tuning Wizard for the first time. Let me try to explain. I have a large table (~200K rows) and a corrections table that contains corrected fields for a subset of the original table's rows. I have to merge the original table with the corrections, then I have to perform many queries (as UDFs) on the merged table. I -was- using a UDF to merge the two tables (let's call it fn_tblMerged) in all of my queries, but it's just too slow to do the merge every single time I run a query on it. So I started using an overnight job to merge the tables and write it out to a new table--call it tblMerged--and now I run my queries on tblMerged instead of fn_tblMerged.
Well now the table is large enough again that my queries are too slow, even without using the UDF for merging on the fly, and I'm pretty confident it's an indexing problem. So I tried to run the index tuner. I created a workload file in 2 ways and had problems with each:
1. I wrote out about 10 queries in the form of "SELECT * FROM fn_WhateverUDF(params)". The wizard ran but when it was finished it had no suggestions for index changes.
2. Second, I copied the actual SQL code of the UDFs into the workload file. Then the wizard said, "The workload does not contain any events or queries that can be tuned against current database. Check to see that the trace contains valid events or that the SQL script contains valid queries and you are tuning against the right database."
Now the little caveat here, I think, is that many of my UDFs are joining tables in OTHER databases (same computer), and I'm beginning to think that Tuning Wizard is having trouble with this. Does anyone know if it's possible to run the tuning wizard on queries that join tables across more than one database?

Secondly: I have a join optimization question. "SELECT * FROM Tbl1 INNER JOIN Tbl2 ON Tbl1.Name = LEFT(Tbl2.Name, 6)". Do I lose benefits from indexing by joining on LEFT(Tbl2.Name,6) because this is calculated? The problem is I have external data that is not allowed to be changed, and in addition to joining on Tbl1.Name = Tbl2.Name, I have to ALSO join on Tbl1.Name=LEFT(Tbl2.Name,6). This is because sometimes the data I receive contains either the full Name or only the first 6 characters of Name, so I have to check both cases on my join. Is there a smarter/faster way of acheiving this?
Thank you!

Viewing 15 posts - 1 through 15 (of 42 total)

You must be logged in to reply to this topic. Login to reply

Ian Yates SSCoach Points: 19738 More actions · Answer 1

For your first problem, take the 10 queries, put them into a query analyser window one by one and run them, but make sure you goto Query->Show Execution Plan.

Then look at the graphical plan - if you see any "table scans", you should have an index. If you see clustered index scans, then it depends on whether your query is getting data from the particular table in a definite range (in which case a scan is ok) or if it is scattered data from the table (in which case a scan is probably not ok and you should create a more particular index).

As for your second problem, again, put it into query analyser with and without the LEFT function. If one uses an index and the other does not, then you are losing the benefits of any indices you have. I think however, it should be ok as you still get the benefits of indices using the like operator - eg like 'abcdef%' with the % sign at the end...

If LEFT does penalise you, consider rewriting it to use the like operator instead? Not sure how you might do this without LEFT though

Steve Fontaine SSC-Addicted Points: 415 More actions · Answer 2

I understand what you mean about the table scans. In some instances I am getting table scans, and in others I get index scans. Never index seeks. Frankly, I am self-taught on this stuff and I have yet to see a good writeup online about how to choose indexes and why the good ones work so well. If you have any ideas, please let me know.

On the second thing, thanks, I'll try that. I actually originally had LIKE there but I figured there was no way it would be faster than LEFT. Thanks.

Still, if anybody has any idea why I'm having trouble with the Index Tuner--particularly across tables in 2 databases--please let me know. Thank you.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 3

Convert the udf into a single master query and it'll solve your performance problem. It'll still be somewhat slow but it will take only a few minutes at worst.

Steve Fontaine SSC-Addicted Points: 415 More actions · Answer 4

Remi, I would have done that but the merge query is rather lengthy and I have to propogate this change into a LOT of of other UDFs. (I must add, one thing I've found in my database work is that good OO programming does NOT make for good queries.) So I chose to just run the job overnight to create the merged table.

Essentially, I am trying to convert all my queries so that none of them call any other functions. Just pure SQL with no UDF or view or SP calls. It's still too slow. That's why I'm asking about the indexing.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 5

You're working with 200k lines, it won't take 1ms to run the query .

can you send the code of the udf to see if it can be included in the main select (send the main select of course)?

Steve Fontaine SSC-Addicted Points: 415 More actions · Answer 6

Here is the code from one of my UDFs. Changes have been made to the table and database names to protect the innocent (and my job ) but otherwise it's identical to my UDF. When I put this text in a workload file and replace @StartTime and @StopTime with NULLs, I get the message I talked about in item #2 in my first post in this thread. It runs in Query Analyzer.

The problem is that tbl and tblMerged are going to get significantly larger as the coming months pass. Now, I know this isn't going to run in 1ms as you said, but I would ideally like to see this happen in under 30 seconds with, say, 1mil-2mil records, if such a thing is possible. This is running on a Dell server machine with RAID and dual Xeons. Currently, the following query takes about 40 seconds to run; I've broken it up for easier reading.

Notice the outer join at one point; unless there is a better way, I have to use this, as all VehicleIDs from {tblVehicle inner join tblMerged} must be included in the results whether or not there is a match in tblVehicleStart. Then, if there is NOT a match in tblVehicleStart, I have to do a calculation for that field (MilesDriven). I apologize for the complexity of this, but here's what I'm facing...

*Also*: Notice the [Common] prefix on many fields and tables. Common is the external database (same computer) from which I'm pulling data, so anywhere that you don't see a [Common] indicates that the data is being pulled from the same database in which this UDF is located.

SELECT [Common].dbo.tblVehicle.VehicleID,

SUM(tblMerged.StopOdometer - tblMerged.StartOdometer) AS MilesCollected,

MAX(tblMerged.StopOdometer) -

(CASE WHEN ((@StopTime IS NULL) AND (NOT tblVehicleStart.StartMileage IS NULL))

THEN dbo.tblVehicleStart.StartMileage

ELSE MIN(tblMerged.StartOdometer) END) AS MilesDriven,

MAX(tblMerged.StopOdometer) AS LastOdometer

FROM tblMerged

INNER JOIN [Common].dbo.tblVehicle ON (tblMerged.VIN = [Common].dbo.tblVehicle.VIN OR

tblMerged.VIN = RIGHT([Common].dbo.tblVehicle.VIN, 6) OR

RIGHT(tblMerged.VIN, 6) = [Common].dbo.tblVehicle.VehicleID)

LEFT OUTER JOIN dbo.tblVehicleStart ON [Common].dbo.tblVehicle.VIN = dbo.tblVehicleStart.VIN

WHERE (tblMerged.StartOdometer <> 0) AND (tblMerged.StopOdometer <> 0)

AND (tblMerged.StartTime >= @StartTime OR @StartTime IS NULL)

AND (tblMerged.StopTime <= @StopTime OR @StopTime IS NULL)

GROUP BY [Common].dbo.tblVehicle.VehicleID, dbo.tblVehicleStart.StartMileage

I'm not posting the query that generates the merged table because at this point I am not interested in putting that code into all my UDFs. It involves about 20 fields with around 10 CASE statements. Instead I am creating that merged table every night and trying to index it appropriately. Thanks.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 7

You run this query for every line of the main query??????? no wonder it's so slow.

WHERE (tblMerged.StartOdometer 0) AND (tblMerged.StopOdometer 0)

AND (tblMerged.StartTime >= @StartTime)

AND (tblMerged.StopTime <= @StopTime)

just set @StartTime to 0 and @EndTime to GetDate() when they are null

(tblMerged.VIN = [Common].dbo.tblVehicle.VIN OR

tblMerged.VIN = RIGHT([Common].dbo.tblVehicle.VIN, 6) OR

RIGHT(tblMerged.VIN, 6) = [Common].dbo.tblVehicle.VehicleID)

anyway you can go around the left/right use on the VIN??

Steve Fontaine SSC-Addicted Points: 415 More actions · Answer 8

I think we're not on the same page with this "main query" idea. All I'm saying is, the query I posted above takes about 40 seconds to run currently, by itself. For the time being, let's just say I'm focused on making that query run fast.

I was under the impression that using the WHERE clause like I had it was faster: WHERE (StartTime >= @StartTime OR @StartTime IS NULL). My expectation was that SQL Server would realize that @StartTime is NULL and just say "OK, this clause is true for every record, so I don't have to check StartTime >= @StartTime anymore". Is it not this smart? On the other hand, I expected that if @starttime is NOT NULL, then it would automatically know it must check StartTime >= @StartTime for every row.

The other thing is, @StopTime could be greater than GetDate() (but it would be safe to set @StartTime=0).

Sadly there is no way to get around the left/right on the VIN. I voted for using the full VIN when we propogate the data tables, but they refuse to change the original data. I am tempted to also store the Right 6 as a separate field in tblMerged and then do this:

tblMerged.Right6VIN = RIGHT([Common].dbo.tblVehicle.Right6VIN, 6)

tblMerged.Right6VIN = [Common].dbo.tblVehicle.VehicleID

In terms of normalization this is not great, but I suspect it would be a bit faster. I know this is a funky where clause, but my superiors refuse to store the data differently. I am happy--I think--with using RIGHT(tblVehicle,6) because tblVehicle's size is fixed at only around 80 rows.

Any other suggestions? Again, particularly I'm concerned with the indexing. Thanks as usual for your help!

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 9

Well my first concern is making sure that the provided query can even use indexes. As it is written you'll get scans on any index (assuming you have the correct ones), even if you didn't have indexes it probabely wouldn't make much of a difference because you'd probabely get clustered index scans anyways since it'd be faster than do multiple index scans + merge join (or index intersection) + bookmark lookups.

Can you ship the text plan of the query?

SET SHOWPLAN_TEXT ON

GO

--Select query here

GO

SET SHOWPLAN_TEXT OFF

Paste the results here.

Steve Fontaine SSC-Addicted Points: 415 More actions · Answer 10

Certainly. This is the workplan of the original query I gave a couple posts back (without the changes we talked about).

|--Compute Scalar(DEFINE: ([Expr1009]=[Expr1007]-If ([tblVehicleStart].[StartMileage]<>NULL) then [tblVehicleStart].[StartMileage] else [Expr1008]))

|--Parallelism(Gather Streams)

|--Stream Aggregate(GROUP BY: ([tblVehicleStart].[StartMileage], [tblVehicle].[VehicleID]) DEFINE: ([Expr1006]=SUM([partialagg1015]), [Expr1007]=MAX([partialagg1016]), [Expr1008]=MIN([partialagg1017])))

|--Sort(ORDER BY: ([tblVehicleStart].[StartMileage] ASC, [tblVehicle].[VehicleID] ASC))

|--Parallelism(Repartition Streams, PARTITION COLUMNS: ([tblVehicleStart].[StartMileage], [tblVehicle].[VehicleID]))

|--Hash Match(Partial Aggregate, HASH: ([tblVehicle].[VehicleID], [tblVehicleStart].[StartMileage]), RESIDUAL: ([tblVehicle].[VehicleID]=[tblVehicle].[VehicleID] AND [tblVehicleStart].[StartMileage]=[tblVehicleStart].[StartMileage]) DEFINE: ([partialagg1015]=SUM([tblMerged].[StopOdometer]-[tblMerged].[StartOdometer]), [partialagg1016]=MAX([tblMerged].[StopOdometer]), [partialagg1017]=MIN([tblMerged].[StartOdometer])))

|--Hash Match(Right Outer Join, HASH: ([tblVehicleStart].[VIN])=([tblVehicle].[VIN]), RESIDUAL: ([tblVehicle].[VIN]=[tblVehicleStart].[VIN]))

|--Parallelism(Distribute Streams, PARTITION COLUMNS: ([tblVehicleStart].[VIN]))

| |--Clustered Index Scan(OBJECT: ([].[dbo].[tblVehicleStart].[PK_tblVehicleStart]))

|--Parallelism(Repartition Streams, PARTITION COLUMNS: ([tblVehicle].[VIN]))

|--Nested Loops(Inner Join, WHERE: (([tblMerged].[VIN]=[tblVehicle].[VIN] OR [tblMerged].[VIN]=right([tblVehicle].[VIN], 6)) OR right([tblMerged].[VIN], 6)=[tblVehicle].[VehicleID]))

|--Clustered Index Scan(OBJECT: ([].[dbo].[tblMerged].[PK_tblMerged]), WHERE: ([tblMerged].[StartOdometer]<>0 AND [tblMerged].[StopOdometer]<>0))

|--Table Spool

|--Index Scan(OBJECT: ([Common].[dbo].[tblVehicle].[IX_tblVehicle_VIN]))

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 11

Hmm.. may I ask for the ddl of the tables.

Can you change the clustered index on the tables?

Steve Fontaine SSC-Addicted Points: 415 More actions · Answer 12

Pardon the newbness but you'll have to define "ddl," sorry. And yeah I probably could change the clustered index as long as it doesn't destroy anyone else's query efficiency.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 13

ddl = data definition langage. the script to create the table. If you could switch the clustered index to starttime, endtime, vehiculeid, starttime, endtime. That way you could probabely get index seeks.

Also for the vin, you can always add a calculated column, index it and do the join on that. So instead of a full scan, you'd get index seek+bookmark lookup which would be much faster.

Steve Fontaine SSC-Addicted Points: 415 More actions · Answer 14

Here is, I believe, all that pertains to this query:

CREATE TABLE [tblMerged] (

[ID] [int] IDENTITY (1, 1) NOT NULL ,

[VIN] [varchar] (17) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

[StartTime] [datetime] NULL ,

[StopTime] [datetime] NULL ,

[StartOdometer] [float] NULL ,

[StopOdometer] [float] NULL ,

CONSTRAINT [PK_tblMerged] PRIMARY KEY CLUSTERED

(

[ID]

&nbsp ON [PRIMARY]

) ON [PRIMARY]

GO

CREATE TABLE [tblVehicleStart] (

[VIN] [varchar] (17) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,

[StartMileage] [float] NULL ,

CONSTRAINT [PK_tblVehicleStart] PRIMARY KEY CLUSTERED

(

[VIN]

&nbsp ON [PRIMARY]

) ON [PRIMARY]

GO

CREATE TABLE [tblVehicle] (

[VehicleID] [varchar] (6) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,

[VIN] [varchar] (17) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,

CONSTRAINT [PK_tblTruck] PRIMARY KEY CLUSTERED

(

[VehicleID]

&nbsp ON [PRIMARY] ,

) ON [PRIMARY]

Basically, what you last said was what I was curious about. The RIGHT( ,6) operation, I imagine, is expensive to do a few hundred thousand times. I have not experimented with covering indexes, which is what I suspect I need here.