February 10, 2010 at 8:47 am
The dreaded BIT field problem
Why wouldn't you just add a non-clustered index to the IsProcessed and IgnoreForProcessing columns? The long VARCHAR columns are not relevant. Given that you are writing this in 2010, you could also suggest a filtered index, which is probably the best solution in this case.
February 10, 2010 at 8:47 am
Run DBCC FREEPROCCACHE when you want to clear all execution plans from the cach.
http://www.sqlmag.com/Article/ArticleID/47502/Using_Dynamic_Filters_in_Queries.html
February 10, 2010 at 8:58 am
Excellent job! Well written article. I was worried when I read the article's description in the e-mail summary: "Do your system contain a time bomb in the code?" 😀
February 10, 2010 at 8:58 am
Great article and I agree with almost all of of the point made and the solutions offered. I would suggest using a nonclustered index with a where clause to solve the BIT field problem. His suggested solution works (and I've used it myself) but it requires additional client side coding and/or triggers on the base table -- both add an addition load on the stressed server. Changing the app may not be possible or within time constrains. Great article nonetheless. Thanks!
February 10, 2010 at 9:59 am
WOW David !
This should be required reading for all System Architects, Database Designers and SQL Programmers. It is clear and succinct in its explanation of what we collectively know as what being a DBA is all about. Thanks !
RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."
February 10, 2010 at 10:21 am
Fantastic article, thank you.
Related to your very first example, one should always think about scaling when writing code. Sometimes, especially when making prototypes, the deliberate plan is to rewrite for scaling and optimization later, but there should always be a consciousness of the need to plan for scaling.
---
Timothy A Wiseman
SQL Blog: http://timothyawiseman.wordpress.com/
February 10, 2010 at 10:56 am
I'm adding this to my repository of excellent SQL articles. Very nice job!
February 10, 2010 at 10:59 am
I didn't use a filtered or nonclustered index on the base table for two reasons.
1. The original solution was on SQL2000.
2. The main table had incredibly heavy read activity on it.
The separate queue table allowed me to have the job quietly polling that very small table while the main read activity hit the large table doing a nice index seek in each case.
I'm not sure that, given the choice, I would put bit flags in the main table in any case. Once you get into the 10s of millions of records even the storage of bit fields costs something. My career started with certain disciplines hammered into you.
1. A stitch in time saves nine.
2. Keep it simple stupid. KISS.
3. Attention to detail in all things
4. Store only what is necessary, when it is necessary.
Some of the latest thoughts on technical debt in agile development are bringing these ancient thoughts back into the forefront of developer thinking. Long may it last.
February 10, 2010 at 11:16 am
I missed the SQL 2000 limitation. Also, too many times we as DBAs need to solve these performance problems and that changing the client app is not an option.
David.Poole (2/10/2010)
My career started with certain disciplines hammered into you.
Ain't that the truth! 🙂
February 10, 2010 at 12:04 pm
David.Poole (2/10/2010)
I didn't use a filtered or nonclustered index on the base table for two reasons.1. The original solution was on SQL2000.
Given SQL 2005/2008 would you recommend the same solution, use a filtered index, or something else?
February 10, 2010 at 12:30 pm
Given SQL 2005/2008 would you recommend the same solution, use a filtered index, or something else?
I think I would still stick with the queue table for the reasons mentioned earlier.
My thoughts on filtered indexes are that they would be useful where an application needed to record many different possible values in a field but would only need to search on a few values. I'm thinking of a workflow thing, I want to see records below a certain status but beyond that I have no interest.
The key thing is that data generated in a backend system can be a different structure to the data in the front end application which in turn could and should be different from the structure in a reporting environment. Indexing strategies in each environment can be very different. For that matter referential integrity can be very different in each environment. Data in a reporting environment may not need referential integrity because the data is created somewhere else where DRI is enforced. Before anyone jumps in, yes, I do know that the query optimiser makes use of the various constraints to produce an execution plan.
I'm working on an agile project at the moment where we might use a filtered index in the short term but the longer term plan is to split the table into a queue table and a longer term table.
February 10, 2010 at 1:03 pm
If I remember correctly SQL2000 and previous had "issues" with BIT columns.
I also was dealing with 100+ Million row tables that had many BIT columns and the simple act of CASTing the literal to a BIT had incredible speed increases. For example:
SELECT *
FROM Foo
WHERE BitColums = CAST(1 AS BIT)
February 10, 2010 at 1:05 pm
Excellent article. I have run into all of these things.
Another poor practice is using bloated datatypes. On a large database, using the smallest reasonable integer, date, and money types can save a lot of space. This translates to less reading to get the needed information. The problem is compounded when the bloated field is a primary key and used as a foreign key in a number of other tables.
A related matter is using a composite primary key, and then using that as a foreign key in multiple child tables. Using a right-sized integer field for a surrogate PK saves lots of space in the child tables.
I'm saving this article as a favorite. Thanks again!!
February 10, 2010 at 1:18 pm
Thanks David, I appreciate your follow up comments and great article.
February 10, 2010 at 2:54 pm
I think there's some misunderstanding about what the TOP statement does (on the part of some readers, not the author). Unless TOP is used with a PERCENT value or an ORDER BY clause, ordering is arbitrary. The first value found is the first value fetched. Much like with the EXISTS statment, the operation returns a value once the first value satisfying the condition is found (it achieves this differently, though, as EXISTS uses a constant scan). In each of the following scenarios:
1) SELECT 1 FROM
WHERE [INDEXED_COLUMN] = 'QUERY_VALUE'
2) SELECT TOP 1 * FROM TABLE WHERE [INDEXED_COLUMN] = 'QUERY_VALUE'
3) SELECT TOP 1 [INDEXED COLUMN] FROM
WHERE [INDEXED_COLUMN] = 'QUERY_VALUE'
4) SELECT TOP 1 1 FROM
WHERE [INDEXED_COLUMN] = 'QUERY_VALUE'
The first operation for all of these is an index seek. For item #1, there are no other operations (other than SELECT itself, obviously). For item #2, there will also be a Bookmark Lookup to identify the table row so that the other columns can be fetched and then the TOP operation, which returns the first hit. For item #3, the next operation is the TOP operation. For item #4, the reported operations are identical to those for item #3.
Operationally, item #1 is simplest, but the cost estimate of the TOP operation is 0%, and the execution plan doesn't tell the whole story. This is why you should also perform runs with STATISTICS IO and STATISTICS TIME on. As has been mentioned earlier, you will want to run DBCC FREEPROCCACHE between each run to clear the execution plan, but you will also want to run DBCC DROPCLEANBUFFERS to clear the results from the buffer cache in memory.
I should first mention that you should be careful here. If you just run the SELECT statements as above, you WILL NOT get the same results as you will if the SELECT statements are wrapped in an EXISTS statement. The reason for this should be obvious. Item #2 must return all the columns for the row found and item #1 must return a 1 for every row that matches the WHERE condition, whether this is one row or 500,000 rows. If EXISTS is used, only the existence of a value is queried, and the operation stops and returns its results. If you only run the SELECT statements by themselves, then item #2 will perform a massive number of logical reads and item #4 will perform a large number of logical reads.
Wrapped in an EXISTS statement, however, all of the above perform the same number of reads. Differences in CPU time and run-time are minute and inconsistent. Real distinctions can probably only be found running several iterations with different tested values. As always, test these assumptions against real-word results. Remember that my examples are all comparisons of a single value against a single indexed column. If you CAST the value being compared (in order to perform calculations on it, for example), you will get different results.
Viewing 15 posts - 31 through 45 (of 73 total)
You must be logged in to reply to this topic. Login to reply