January 31, 2008 at 2:54 pm
I have spent the last few weeks trying to improve a web based systems performance by
looking at badly performing queries (suffering from blocking chains / DB timeouts etc) and
tweaking the SQL and adding/editing the indexes.
I have managed to cure about 90% of all the error reports we have been getting so its down to a trickle rather than a torrential down pour but today I have been looking at a particular error that is causing DB Timeout expired errors. The offending code is a stored procedure that carries out 2 select staments.
The first looks up a particular customers featured job allowance and then the second works out that customers current number of live featured jobs. It then does some tests on the local variables to see whether they are still within or over their allowance etc.
the 2 main parts of the proc are:
SELECT@FeaturedAllowance = FeaturedAllowance
FROMCLIENTS
WHERESiteFK = @SitePK AND
ClientPk = @ClientPK
SELECT@NoFeaturedJobs = count(JobPK)
FROMJOBS
WHEREClientFk= @ClientPK AND
SiteFk= @SitePK AND
FeatureScore BETWEEN 1 AND 5AND
Live= 1
I ran both statements together with execution plans showing and statement 1 is 1% cost and statement 2 is 99% relative to the batch.
Now I currently have 5 indexes on the JOBS table which is constantly updated/inserted as well as being selected from throughout the day.
The existing indexes on the table before I made any changes were the following:
1: PK_JOBS (JobPK) clustered
2: nclidx_JOBS_JobPK_ClientFK_Live
3: nclidx_JOBS_Live_TempMode_FeatureScore
4: nclidx_JOBS_SiteFK_ClientFK_UserFK
5: nclidx_JOBS_CreateDate_LiveDate_ExpiryDate
I can't remember exactly which indexes the query plan was originally using before I changed anything but I know it had 3 parts (clustered index seek and 2 scans I think)
So the 2nd select was taking anywhere between 10 and 30 seconds which wasn't good. I decided that rather than create a totally new index for this query to instead add the columns (FeatureScore and SiteFK) to index (2 = nclidx_JOBS_JobPK_ClientFK_Live) which I had created the other day
for another query that resolved a different timeout issue. So I added in those 2 columns and then rebuild the index. I also updated statistics for the whole database. When I next ran the query though the costs & speed was exactly the same (10-30 seconds, 99% cost), and although the query plan
had changed it hadn't changed to use the index I expected but was now consisting of 2 parts using a combination of index 1 PK_JOB and doing a lookup and index 4 nclidx_JOBS_SiteFK_ClientFK_UserFK.
The plan is here.
StmtText
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1012],0)))
|--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))
|--Filter(WHERE:([XXXXX].[dbo].[JOBS].[featureScore]>=(1) AND [XXX].[dbo].[JOBS].[featureScore]<=(5) AND [XXX].[dbo].[JOBS].[live]=(1)))
|--Nested Loops(Inner Join, OUTER REFERENCES:([XXXX].[dbo].[JOBS].[JobPK], [Expr1011]) OPTIMIZED WITH UNORDERED PREFETCH)
|--Index Seek(OBJECT:([XXX].[dbo].[JOBS].[nclidx_JOBS_SiteFK_ClientFK_UserFK]), SEEK:([XXXX].[dbo].[JOBS].[SiteFK]=(59) AND [XXX].[dbo].[JOBS].[ClientFK]=(3184)) ORDERED FORWARD)
|--Clustered Index Seek(OBJECT:([XXXX].[dbo].[JOBS].[PK_JOBS]), SEEK:([XXXX].[dbo].[JOBS].[JobPK]=[XXXX].[dbo].[JOBS].[JobPK]) LOOKUP ORDERED FORWARD)
(6 row(s) affected)
I have also put the query into the query optimiser which said it could improve it by 98% by creating the following index:
CREATE NONCLUSTERED INDEX [_dta_index_JOBS_5_930154409__K2_K3_K18_K16_K1] ON [dbo].[JOBS]
(
[SiteFK] ASC,
[ClientFK] ASC,
[featureScore] ASC,
[live] ASC,
[JobPK] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
Which is almost identical to the index I created earlier except in mine I put the columns in order of most selective column first. The definition of my index is:
CREATE UNIQUE NONCLUSTERED INDEX [nclidx_JOBS_JobPK_ClientFK_SiteFK_FeatureScore_Live] ON [dbo].[JOBS]
(
[JobPK] ASC,
[ClientFK] ASC,
[SiteFK] ASC,
[featureScore] ASC,
[live] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
So the questions are:
1. Why isn't the query optimiser using the covering index that I would expect it to. Am I missing something or should I be doing something else like add index hints to the SELECT?
2. When you use the query optimiser tool is it clever enough to know that there are already indexes on the table in question and therefore it wouldn't suggest an index when that index already existed on the table. Or does it know that there is an almost identical index on the table but still thinks
that changing the order of the columns would speed it up by 95% >?
3. At what number of indexes on a table does it become total overkill. Should I create an index per query even if the columns in existing indexes are very similar. E.G before I touched the existing index all the columns in the SELECT could be found in 3 of the existing indexes on that table.
I thought that the optimiser could use multiple indexes to calculate a plan and so trying to create a covering index per query was not needed. Someone here tells me they have 28 non clustered indexes on their main table but I cannot believe that is a good idea.
The JOBS table is always getting updated and I need the save process to be as fast as possible so creating more indexes is more time to save.
4. Does anyone have any other suggestions of how I can speed this query up?
From sys.dm_db_index_usage_stats for the table in question
NameType_Descuser_Seeksuser_scansuser_lookups
PK_JOBSCLUSTERED81152154652326
nclidx_JOBS_JobPK_ClientFK_SiteFK_FeatureScore_LiveNONCLUSTERED10833980
nclidx_JOBS_SiteFK_ClientFK_UserFKNONCLUSTERED4633245270360
nclidx_JOBS_Live_TempMode_FeatureScoreNONCLUSTERED11807801140
nclidx_JOBS_CreateDate_LiveDate_ExpiryDateNONCLUSTERED24469136070
Thanks in advance for any help.
January 31, 2008 at 3:26 pm
I have actually just dropped the index I created and implemented the index suggested by the query optimiser wizard which had exactly the same columns but in a different order and low and behold it has speeded up dramatically, the cost has dropped from 99 to 50% of the batch and the query plan has changed to use that 1 index e.g.
StmtText
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1011],0)))
|--Stream Aggregate(DEFINE:([Expr1011]=Count(*)))
|--Index Seek(OBJECT:([XXX].[dbo].[JOBS].[JOBS_SiteFK_ClientFK_FeatureScore_Live_JobPK]), SEEK:([XXX].[dbo].[JOBS].[SiteFK]=(59) AND [XXX].[dbo].[JOBS].[ClientFK]=(3184) AND ([XXX].[dbo].[JOBS].[featureScore], [XXX].[dbo].[JOBS].[live]) >= ((1), (1)) AND ([XXX].[dbo].[JOBS].[featureScore], [XXX].[dbo].[JOBS].[live]) <= ((5), (1))), WHERE:([XXX].[dbo].[JOBS].[live]=(1)) ORDERED FORWARD)
Now I always thought and have read that when creating indexes you should add the most selective columns first which is what I did with my index and thats the only difference between the two.
Have I missed something??
January 31, 2008 at 4:39 pm
A simple (but usually accurate) way of thinking about an index is to compare it to the index in a book or a phone book. In this case, think of it like a phone book. Let me change your select slightly to the phone book analogy:
SELECT count(PhoneNumber)
FROM Book
WHERE LastName = @LastName AND
FirstName = @FirstName AND
ZipCode BETWEEN 10000 AND 50000
The index you created would be similar to PhoneNumber, Lastname, Firstname, Zipcode
Now... the query you send says "Give me the number of phone numbers for Smith, Michael with a zipcode between 10000 AND 50000". If the phone book is sorted by phone number, then by last name, then by firstname.... you have to read the whole book to find out how many lines have the right lastname and firstname and zipcode. But if the phonebook were reordered to have last name then first name then zip, you could quickly turn to the Smiths, find the group of Michaels, look at each of the zip codes and when you were done with that section, you wouldn't have to go through the rest of the book. Much faster.
The selectivity comment you made was correct. If there are fewer "Michaels" than "Smiths", the phone book would be better ordered by firstname, lastname... but ordering it by phone number when phone number isn't part of the criteria to reduce rows will not help....
Hope this makes sense,
Chad
January 31, 2008 at 11:24 pm
SELECT @NoFeaturedJobs = count(JobPK)
FROM JOBS
WHERE ClientFk = @ClientPK AND
SiteFk = @SitePK AND
FeatureScore BETWEEN 1 AND 5 AND
Live = 1
In general, for covering indexes, you want the columns in the following order in the index. (NB - General. There are many cases where you would want a different order)
1) equalities in the where and join clause
2) inequalities in the where and join clause
3) Other columns references, either as index keys, or as include columns.
Since (I assume) you're using SQL 2005. This index is prbably best for that query.
Create index idx_JOBS_ClientKKSiteFKLiveFeatureScore ON Jobs
(ClientFK, SiteFK, Live, FeatureScore)
INCLUDE (JobPK)
btw, since JobPK is the primary key, and hence the column is not nullable, Count(JobPK) is equivalent to Count(*) and you can rewrite that query as follows.
SELECT @NoFeaturedJobs = count(*)
FROM JOBS
WHERE ClientFk = @ClientPK AND
SiteFk = @SitePK AND
FeatureScore BETWEEN 1 AND 5 AND
Live = 1
In that case, you don't need the JobPK as an include column in the index at all.
Does that help?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
February 1, 2008 at 4:47 am
Yes thanks for both those replies they really explained the issue very well.
So even though JobPk is the most selective item in that table (a different value per record) it doesn't help at all to have the index ordered by that column first as the values for that column are not needed to be known to satisfy the query. So the optimisers choice of putting it as the last ordered column was correct and if I rewrite the query to use Select(*) instead of select(jobPK) I could in fact remove it from the index completely.
February 1, 2008 at 5:07 am
Yup.
The whole issue of 'most selective column first in an index' is a bit misleading. SQL can, and does, do multi-column seeks as one operation. What you want is the columns left-most in the index to reduce the rows that need to be onsidered as fast as possible.
To use a quick made up example...
Create table Testing (
ID int identity primary key, -- the clustered index by default
SomeDate DATETIME,
Status tinyint
)
Let's say there are 10 day's data in there, 1000 rows per day. Since the date column stores date and time, it's very, very selective. (almost unique)
There are 10 possible statuses evenly distributed. so each status appears about 1000 times.
If we do a query
Select count(*) from Testing
where Status = 7 and SomeDate between '2008/01/31' and '2008/01/31 23:59:59.997'
There are 2 possible indexes we could consider for that query. SomeDte, Status and Status, someDate.
In this case, the index with status first is more appropriate, even though SomeDate is more selective than Status.
Using the index with Status first, SQL scan seek directly to the first row it wants - Status = 7 and the first Somedate > '2008/01/31' and then read all the rows in the index until it reaches the last date in the range that it needs. Total rows touched, about 100.
If, however, the index was on Somedate and status, SQL would have to seek to the first row of the date range, read all the rows of the date range it needs, then filter out the unwanted statuses. Total rows touched - 1000.
Does that help?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
February 1, 2008 at 5:39 am
Gail , Chad
Superb explanation about indexes , especially Gails example.
Rob:
You said you dropped a index you had created to solve a time out issue and instead created the index suggested by the optimizer. Did your other time out issue (not the issue asked in this forum) come up. Was there any difference in that other query after creating the new index as suggested by the optimizer.
"Keep Trying"
February 1, 2008 at 6:07 am
Rob Reid (1/31/2008)
3. At what number of indexes on a table does it become total overkill.
There's no hard and fast number. It depends on the datba and the data. If I have a tables that's mainky used for reporting and updated once a month, I'll probably index it far more thn a table that's getting updated every minute.
Should I create an index per query even if the columns in existing indexes are very similar.
No. What you want is the smallest possible set of indexes that satisfy as many queries as possible, either completely or partially.
Queries can use any index where the columns that are filtered on (in where or join) are a left-based set of the indexes columns.
eg. Consider an index Col1, Col2, col3
A query that filters on Col1 and Col2 can do a seek on that index.
A query that filters on Col2 and Col3 cannot. It can at best do a scan.
(Think of a telephone book if you're confused as to why, and consider that a telephone book is an index of the people in an area, on columns (surname, firstname, middle initial, address, phone number))
I thought that the optimiser could use multiple indexes to calculate a plan and so trying to create a covering index per query was not needed.
It can, but it's quite unlikely to do so. More likely, the optimiser will pick an index that is the best of what's there, then lookup back to the cluster for missing columns, then filter unnecessary rows out.
Someone here tells me they have 28 non clustered indexes on their main table but I cannot believe that is a good idea.
Probably not, unless it's a data warehouse or similar. I would hope that table has at least 28 columns.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
February 1, 2008 at 6:47 am
Another good explaination Gila about the ordering of columns in the index. Very valuable info which I will put to good practise in future now that I understand a bit more about the subject.
What happened Chirag was that I had created an index to solve a timeout issue and with this problem I didn't want to create another index so I added the 2 extra columns into the existing index.
This didn't work so I dropped that index and recreated the original index for the original problem and implemented the optimisers suggestion for the new problem. So at the moment I have 2 separate indexes for 2 queries. However with the info I've just gained from Gila I reckon that I could rewrite that first query as well so that they both use the same indexes.
February 4, 2008 at 5:13 am
Rob
Thanks for the info. Yes rewriting the query to use the same index will be a good idea. Keep us posted.
"Keep Trying"
February 4, 2008 at 5:18 am
Could you post that other query please Rob?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
February 4, 2008 at 5:46 am
The original queries were
SELECTCount(JobPK)
FROMJOBS
WHEREClientFK in(X,X,X) AND
Live=1
OR
SELECTCount(JobPK)
FROMJOBS
WHEREClientFK = X AND
Liive=1
which I created a non clustered index with ClientFK,JobFk,Live (in that order)
I then replaced the COUNT(JobPk) with COUNT(*) and dropped the JobPk
however I have found this cool piece of code on MS KB that uses the new SQL 2005 missing
index functions and an MS Agent job every 30 mins that populates a table with recommendations
for either new indexes or to drop existing indexes.
http://msdn2.microsoft.com/en-us/library/ms345524.aspx
http://msdn2.microsoft.com/en-us/library/ms345407.aspx
http://msdn2.microsoft.com/en-us/library/ms345421.aspx
http://blogs.msdn.com/queryoptteam/archive/2006/04/06/570176.aspx
It came up with a suggestion for a totally new index that as well as ClientFK, Live had
4 included columns (JobPk, JobTitle,ExpiryDate,Email)
So rather than create that as well as the existing index I built the recommended index and dropped the old one and it seems to be working pretty well as if I query the system function under user_Seeks it has 1481 with 0 in user_scans and 0 in user_lookups.
select i.Name, i.Type_Desc, i.Is_unique, d.*
from sys.dm_db_index_usage_stats as d
join sys.indexes as i
on i.object_id = d.Object_id and
i.index_id = d.Index_id
where database_id=db_id() and
d.object_Id=object_id('JOBS')
order by name
I don't know what you think about those missing index system functions but from the recommendations they have been making it seems to be coming up with some relevant indexes. There is an option to actually have the system self-optimise by not just populating the recommendation table but actually creating/dropping the indexes it comes up with automatically but I haven't gone that far! Although I can see in the not to distant future some sort of auto-optimise tick box option that would put some DBAs out of a job.
February 4, 2008 at 6:09 am
Rob Reid (2/4/2008)
which I created a non clustered index with ClientFK,JobFk,Live (in that order)I then replaced the COUNT(JobPk) with COUNT(*) and dropped the JobPk
Looks pretty good. If you had left the JobPK, the best order would be ClientFK, Live, JobPK, since the filter is on the ClientFK and Live, and columns used in filters should be the left-most coumns in the index
What are the indexes you have on the table now?
I don't know what you think about those missing index system functions
The missing index DMVs are useful, but their suggestions should not be implemented without investgation. They are populated by the query optimiser and it only considers 1 query at a time. They do not make suggestions on clustere/nonclustered. They do not consider existing indexes and sometimes the suggested order of columns is not 100%.
If you implement all the suggestions of the missing index DMVs without consideration, you can easily end up with redundent indexes.
I often see things like this suggested:
1) Equality: Col1, Col2. Inequality: <None> Include: Col3, Col5
2) Equality: Col1, Col2. Inequality: Col3. Include: Col4
Now those can both be the same index. Key columns (in order) Col1, Col2, Col3. Include columns: Col4, Col5
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
February 4, 2008 at 6:43 am
Yes I have noticed very similar indexes being suggested. What the DB generated from the script does though is increment a counter per index so you can order by most occurances descending to see the queries that have been suggested the most times.
At the moment the indexes on the table are:
PK_JOBS (Clustered) JobPK
nclidx_JOBS_SiteFK_ClientFK_FeatureScore_Live_JobPk (optimiser suggested)
nclidx_JOBS_ClientFK_Live inc columns(JobPK,JobTitle,ExpiryDate,Email) (suggested by missing index function which I replaced the older nclidx_ClientFK_Live as I said earlier)
I also have these 3 indexes on the table which are a lot older and I haven't had a time to look at but from a cursory look they don't look very good and were probably created in haste without any real thought.
nclidx_JOBS_SiteFK_ClientFK_UserFK
nclidx_JOBS_Live_TempMode_FeatureScore
nclidx_JOBS_CreateDate_LiveDate_ExpiryDate
The table has been denormalised a bit and therefore contains the SiteFk and ClientFK which in reality could both be accessed by joining to the Client table on UserFK.
The createDate is the date the job was first added into the system.
The liveDate is first set to the same value as CreateDate and then when a job is edited / renewed it
will get updated to the date of that action. The liveDate is used a lot to determine which jobs to show on featured sections like Latest Vacancies & Featured Jobs etc
SELECT Top 5 JobPk, JobTitle, JobDesc FROM JOBS WHERE Live=1 And SiteFk=X AND TempMode=0 ORDER BY LiveDate DESC
or for featured jobs (with a featureScore > 0)
SELECT Top 5 JobPk, JobTitle, JobDesc FROM JOBS WHERE WHERE Live=1 And SiteFk=X AND FeatureScore> 0 AND TempMode=0 ORDER BY LiveDate DESC
The expiryDate is the date that the job will expire and come off the site. A job runs at night setting any jobs with an ExpiryDate<=Getdate() to have a live status of Live= -1.
Live=1 (live visible job), Live=0 (deleted), Live=-1 (expired)
TempMode is a bit flag set when the job is in the process of being saved but not fully completed yet eg TempMode=1. Once the jobs is saved correctly then tempMode=0.
There are lots and lots of queries that use many combinations of these columns in the job table. However I think all the columns that are either in a where,order,join clause do appear in at least one existing index. There are lots of columns that are only ever returned in the SELECT for display on the job view, job edit, job preview page e.g JobTitle, jobDescription, SalaryMin SalaryMax, StartDate etc etc. These are usually all accessed by something like SELECT JobTitle,JobDescription,StartDate FROM JOBS where JobPK=X AND SiteFk=X
so theres as much info as I can give you on my lunch break.
Thanks for helping!
February 4, 2008 at 9:05 am
Looks like you've mostly got it under control.
Regarding those 3 older indexes, and looking at what you've said, I'd suggest the following.
nclidx_JOBS_Live_TempMode_FeatureScore - Change to Live, TempMode, SiteFK, LiveDate, FeatureScore Include (JobTitle, JobDesc)
nclidx_JOBS_CreateDate_LiveDate_ExpiryDate - looks fairly useless as is. Maybe better just on LiveDate or just on Expiry date, though I'm unsure if either would be used.
nclidx_JOBS_SiteFK_ClientFK_UserFK - If you have any queries that filter on UserFK, maybe switch this around so that it has UserFK as the first column. Since you have another index starting with SiteFK, but none with UserFK.
Regarding the query
SELECT JobTitle,JobDescription,StartDate FROM JOBS where JobPK=X AND SiteFk=X
Is the filter on SiteFK necessary? Since JobPK is the primary key and, hence, unique.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 15 posts - 1 through 15 (of 24 total)
You must be logged in to reply to this topic. Login to reply