September 24, 2010 at 3:20 pm
Hi folks,
I'm trying to optimize a query written by a previous developer, against a huge table (not my design), and I need some advice.
The table has over 5 million rows, and is around 1200 bytes wide. The query looks at 13 of this table's columns (in either the WHERE clause or SELECT list). However, instead of posting all that code & schema, I will distill this down to a conceptual example.
Let's say you have a poorly-designed table called Widgets, with the following columns: ID, Attrib_A thru Attrib_Z, plus Widget_Date. Keep in mind that Widget_Date can be changed at any time, so it's not a good candidate for a clustered index.
Let's say that the table has 5 million widget rows. And someone wrote a giant query against this table, joining against other tables, but most importantly doing something like:
SELECT w.Attrib_A, w.Attrib_B, w.Attrib_C, w.Attrib_D
FROM Widgets w
JOIN OtherTable o ON (o.ID = w.Attrib_E AND o.Misc = w.Attrib_F)
WHERE w.Attrib_G = 1234
AND w.Attrib_H = @SprocParam1
AND (w.Attrib_I = 'abcd' OR w.Attrib_J = @SprocParam2)
AND w.Attrib_K = @SprocParam3
AND w.Attrib_L is not NULL
AND w.Attrib_M = @SprocParam4
AND w.Attrib_N = @SprocParam5
AND (w.WidgetDate >= @StartDate AND w.WidgetDate <= @EndDate)
So after some research, I find out that the most pertinent columns are WidgetDate, Attrib_K and Attrib_M. That is to say, these columns will narrow down the results more than the other ones. If I were to query on only those 3 columns, I would get a reasonably small number of rows for most use cases (let's say 3000 rows). The other columns will narrow my results slightly further, and are necessary parts of the query, but make less of a difference in the number of rows (once I've narrowed using the first three).
I also notice that there are no composite or covering indices on the Widgets table. There are individual column indexes (e.g. nonclustered index for ID), and no clustered index.
So, my idea is to query on the 3 most pertinent columns, put those IDs into a temp table, then change the main query to inner-join against the temp table when it retrieves the rest of the columns (or uses them in the WHERE clause). Something along these lines:
CREATE NONCLUSTERED INDEX idxWidgetSearch ON Widgets (WidgetDate, Attrib_K, Attrib_M, ID) -- not sure if the order matters?
CREATE TABLE #tmpIDs (ID int)
-- Find the Widget IDs with specified date, K and M values
INSERT #tmpIDs
SELECT w.ID
FROM Widgets
WHERE (w.WidgetDate >= @StartDate AND w.WidgetDate <= @EndDate)
AND w.Attrib_K = @SprocParam3
AND w.Attrib_M = @SprocParam4
-- Join against temp table to retrieve/compare further information on above widgets
SELECT w.Attrib_A, w.Attrib_B, w.Attrib_C, w.Attrib_D
FROM #tmpIDs tmp
JOIN Widgets w ON (w.ID = tmp.ID)
JOIN OtherTable o ON (o.ID = w.Attrib_E AND o.Misc = w.Attrib_F)
WHERE w.Attrib_G = 1234
AND w.Attrib_H = @SprocParam1
AND (w.Attrib_I = 'abcd' OR w.Attrib_J = @SprocParam2)
-- AND w.Attrib_K = @SprocParam3 -- don't need this anymore; see above
AND w.Attrib_L is not NULL
-- AND w.Attrib_M = @SprocParam4 -- don't need this anymore; see above
AND w.Attrib_N = @SprocParam5
-- AND (w.WidgetDate >= @StartDate AND w.WidgetDate <= @EndDate) -- don't need anymore
So now that I've explained the issue and my proposed solution, here are my questions...
In the temp table version of my code, will SQL Server limit the rows it reads from Widgets to only those which are also in #tmpIDs BEFORE it tries to look for all the other parameters in the WHERE clause? (Attrib_G, Attrib_H, etc.) In other words, will it do any sort of table/index scans against a greater set than #tmpIDs, costing me more query run time? Or should the inner join on #tmpIDs prevent that?
Is the above a good strategy for optimizing this query? Or is there another way that I'm not thinking of?
Would it make any sense to create a huge covering index for columns Attrib_A thru Attrib_N, plus WidgetDate? (So far I'm avoiding this because it seems ridiculous.)
I hope this example is clear enough. Any thoughts are welcome.
Thanks for your time,
Jordan
p.s. Difficulty: I don't have the luxury to create/modify table structures or normalize the data at this point.
September 24, 2010 at 3:39 pm
jordantx (9/24/2010)
In the temp table version of my code, will SQL Server limit the rows it reads from Widgets to only those which are also in #tmpIDs BEFORE it tries to look for all the other parameters in the WHERE clause? (Attrib_G, Attrib_H, etc.) In other words, will it do any sort of table/index scans against a greater set than #tmpIDs, costing me more query run time? Or should the inner join on #tmpIDs prevent that?
That's going to depend on what the optimizer thinks is a good idea according to statistics and what indexes are available to it. 3000/5million is within selectivity (.06%), but the QO has some funky ideas on occassion. You could force an index selection via hints if you wanted, then have it bookmark back to the clustered, and finish the where clause. It's worth exploring and seeing the query plan results.
Is the above a good strategy for optimizing this query? Or is there another way that I'm not thinking of?
Would it make any sense to create a huge covering index for columns Attrib_A thru Attrib_N, plus WidgetDate? (So far I'm avoiding this because it seems ridiculous.)
Your huge covering index would depend a lot on if you could significantly narrow down selectivity more. My understanding, and someone will chime in if I'm wrong, I'm sure (at least I hope), is that you need to get under .03% selectivity to guarantee a seek. On the otherhand, it certainly couldn't hurt to test this out unless that table has extremely heavy churn from an existing record edit perspective.
Part of the problem is the simplification takes away a few methods folks here will look to tune the process.
What's your current clustered index look like?
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
September 24, 2010 at 3:58 pm
You would be much off if you could create a reasonable, relevant clustered index. As Craig noted, SQL can be quirky when it comes to nonclus index usage.
A clus index seek could greatly limit the total number of rows to be scanned, and a clus index can be used even if it only cuts the total rows to be searched by a small amount.
A non-clus index would be dropped from consideration at likely no more than 10% of rows, and possibly much less.
Scott Pletcher, SQL Server MVP 2008-2010
September 24, 2010 at 4:43 pm
Thanks for the replies so far...
The Widgets table has no clustered index. The WidgetDate field is the single most relevant column for searching/querying, however it can change often as I mentioned, so it's a bad candidate for clustered.
The only field I could imagine making a good clustered index might be Widgets.ID. But I'm not expert enough in SQL to understand how that would improve my query performance.
September 24, 2010 at 4:53 pm
jordantx (9/24/2010)
Thanks for the replies so far...The Widgets table has no clustered index. The WidgetDate field is the single most relevant column for searching/querying, however it can change often as I mentioned, so it's a bad candidate for clustered.
The only field I could imagine making a good clustered index might be Widgets.ID. But I'm not expert enough in SQL to understand how that would improve my query performance.
You're in a heap, with no non-clustered indexing? :crazy:
Alright, the clustered index is the lookup from the nonclustered (if the non-clustered doesn't cover all needed fields), and it helps to feed back to the 'leaf level' of the table for all data. Suffice it to say a small, very unique clustered index usually helps.
You mention a lot of churn on the date, and I can understand why you'd be loathe to use that. You end up churning the physical table all over the place, and insert/updates can take a lot of time that way. However, at this point, ANY index would be better then nothing, at least for this query. With no index, clustered or non, you will never get any selectivity against the data in the table. You will always scan it end to end.
Any index you build with your date field involved, however, is going to churn. In this case (and I would test in your dev/qa system for performance differences) I would recommend slapping a clustered against the ID field and the non-clustered you thought about (the smaller, the better, churn wise) with the two primary fields in question, and the date field. You will probably be best off using the date first in that index, it will help give other queries a fighting chance as well, even though it will cause the highest churn. Order matters.
Be aware you'll probably need to reorganize that index on weekly or monthly maintenance windows.
Test it with and without the clustered on the dev system. See what gets you better performance. The reason I mention this is because with all the other selections, the optimizer may decide to go after the clustered index anyway, which means we'll need to hint the query to the non-clustered, and see if there's some secondary recommendations we can add to the non-clustered to help it with selectivity.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
September 24, 2010 at 4:54 pm
Double post, sorry. Dunno how I pulled that off.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
September 24, 2010 at 7:35 pm
Thanks a lot Craig!
Just to clarify, there are a number of non-clustered indexes on the Widgets table. They are all single-column indexes, not composite. However, there is no clustered index.
I think it's a good suggestion to create a clustered index on ID, and the non-clustered as you said, with date first. (I was thinking I should add the ID into the non-clustered one also, as you see from my CREATE INDEX statement example above. My thinking was that SQL Server would look up by date, then by the other two parameters, and the ID would be read back as a "covered" field preventing further reads. But maybe that's not necessary?)
Are you also suggesting that I go ahead with my idea of a 2-step process, selecting IDs into a temp table and then joining against it? Since posting this, I did try that strategy and it runs a lot faster than before (even without adding new indexes). But I can probably get even better performance with the clustered & composite non-clustered (plus index hints as you mentioned).
By the way, this was my first post, and I'm already impressed with the level of expertise on this board. Looking forward to participating further.
Cheers,
Jordan
September 27, 2010 at 9:39 am
I wouldn't create a clus index on an ident column. That will not help your queries. In fact, you will reduce the chance of using a nonclus index.
You need to pick a clus index that will be useful to your queries, and, if possible, one that reduces churn. If you can't avoid churn, you'll have to reduce the FILLFACTOR and reorg / rebuild more frequently.
Scott Pletcher, SQL Server MVP 2008-2010
September 27, 2010 at 10:00 am
scott.pletcher (9/27/2010)
I wouldn't create a clus index on an ident column. That will not help your queries. In fact, you will reduce the chance of using a nonclus index...
You will find quite a lot people here who would disagree with the above statement...
Personally I have nothing againt creating the clustered index on identity column and I do it most of the time in OLTP and OLAP.
May be the link in one of the post under the following topic will help you to deside...:
http://www.sqlservercentral.com/Forums/Topic434378-360-1.aspx#bm435117
September 27, 2010 at 10:02 am
one more url:
September 27, 2010 at 10:33 am
I don't object to an IDENTITY column in tables per se, and they have very legitimate uses.
But I dislike the idea that the default for a clus index should be an ident.
You need to consider the data very carefully first. Your one and only clus index should be very carefully chosen.
If you have a natural key that benefits from a range search, or for a value you (almost) always specify in a query, that will make a much better clus index.
Often a date or another increasing number, such as an invoice number, make a much more useful clus index key and also don't cause page splits (at least significant amounts of them).
For example, on some tables, it helps to cluster by column(s) that are always sorted on for output. If you don't, every query re-sorts the data and you kill your performance.
Scott Pletcher, SQL Server MVP 2008-2010
September 28, 2010 at 8:04 am
1) since you have no clustered index, you best check for forwarded records. See sys.dm_db_index_physical_stats on how to check for this. Note that you need to do DETAILED, which can hit server performance pretty hard. FRs can be disasterous from a performance perspective.
2) interim results can indeed improve performance of large/complex queries.
3) consider OPTION (RECOMPILE) or dynamic SQL to give the optimizer the best chance at creating the optimal plan for each set of calling parameters.
4) your statistics (you do have auto create/update stats on, right?) best be VERY good for such complex filtering/joining. Consider frequent FULLSCAN manual stats updates.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
September 28, 2010 at 11:50 am
scott.pletcher (9/27/2010)
I don't object to an IDENTITY column in tables per se, and they have very legitimate uses.But I dislike the idea that the default for a clus index should be an ident.
In general, Scott, I would agree with this. In this case, though, I believe it's warranted. Because the key's going to be carried through multiple subindexing, and because this table is very wide, I felt the ID field for this table was a good candidate. It helps keep down the number of problems, and the size, of sub-indexes, of which this table looks like it would need multiple ones.
Yes, if you can get away with a solid, single, generic purpose key to the table for the clustered, I'd agree with it. In this case his generic key is the date. It needs to be in nearly everything. Because he's got update churn on that, he's better off keeping that in a tight key, and using the ID to lookup back into the system.
For the interested: (from MS, so there's no concern of the source): http://msdn.microsoft.com/en-us/library/ms177484.aspx
The row locators in nonclustered index rows are either a pointer to a row or are a clustered index key for a row, as described in the following:
If the table is a heap, which means it does not have a clustered index, the row locator is a pointer to the row. The pointer is built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).
If the table has a clustered index, or the index is on an indexed view, the row locator is the clustered index key for the row. If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier. This four-byte value is not visible to users. It is only added when required to make the clustered key unique for use in nonclustered indexes. SQL Server retrieves the data row by searching the clustered index using the clustered index key stored in the leaf row of the nonclustered index.
You're getting an ID on the table one way or another. If you'd like to avoid the the RID, you need a clustered index. If you'd like to avoid the additional mass to multiple sub-indexes, you need a tight clustered.
It all depends on the final purpose of the table. At the least with an identity clustered your RowLocator is tight and you're always doing an append to the tail of the leaf data, so it's not a bad 'normal newbie practice' and also still has a number of good reasons.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
September 28, 2010 at 12:18 pm
I see your points. But, don't overlook the fact that if you're going to create an index on the date anyway, you will churn that nonclus index instead of the clus one. Yes, that's way less overhead because the original table is so wide. That is another problematic feature here.
But you would get a big advantage of a clus index on date: now the queries can do a clus index seek, thus dramatically limiting the number of rows searched.
With date in a non clus index, SQL has to get the id (the clus index key) and do another index lookup via the clus index. Because of the extra overhead, you reduce the chance that a nonclus index will be used at all.
Unquestionally you have to consider just how much churn there is on the date column. If it's too excessive, the point is mute, and date shouldn't be a clus column.
But if the original poster's phrase "it can change at any time" means 1-2% of the records sometimes/occasionally change, it becomes a clear choice to clus on it anyway. If it's 98-99% that continually change, it's a clear choice not to cluster on it.
In between those extremes you have to review it carefully. I would prrefer to reduce the FILLFACTOR appropriately and use the date as the clus index lead column if reasonably feasible.
One could also partition the table to reduce the reorg/rebuild time. Presumably at some point older dates become stable and are no longer changed (?).
Scott Pletcher, SQL Server MVP 2008-2010
September 28, 2010 at 12:43 pm
With all this talk about the clustered index, and churn on the date field, don't forget that all of those numerous non-clustered indexes will be converted to carrying the clustered index columns in it (and in the case of the non-unique date field, a "uniqueifier" hidden "column") to be able to go to the proper row. Thus, all of this churn will also affect all of the NC indexes.
Edit: and after reading Craig's last post a little closer, I see that this is already covered in the MS quote.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
Viewing 15 posts - 1 through 15 (of 28 total)
You must be logged in to reply to this topic. Login to reply