June 28, 2015 at 9:29 pm
Comments posted to this topic are about the item Indexing Foreign Keys
June 29, 2015 at 3:59 am
Thorough, well structured and objective.
Nicely done!
June 29, 2015 at 5:59 am
It's interesting to note that the index usage pattern can change drastically in a case-by-case basis.
But I was also impressed with the usage of CTE's in this article. Great way to see it in use for different situations.
June 29, 2015 at 6:34 am
Interesting article though I wonder about a few things:
1. Using a recursive CTE to populate your test table is probably not the best idea. At least, I wouldn't want any readers to think that it's a good idea. Using a tally table or equivalent is the way to go here.
2. Did you check the query plan cache between tests? I wonder if SQL is reusing an existing plan when you thought it should build a new one.
3. Sometimes, a table scan really is the best way to go. It depends on the size of the table in pages vs the size of the index in pages with possible row lookups. Given that padding the rows as in your example caused the plan to change makes me think that SQL really did choose table scan because it was the best plan. It would be interesting to see the sizes of the table and index in pages.
Gerald Britton, Pluralsight courses
June 29, 2015 at 7:55 am
The largest tables in a database tend to be user "transactional" data. My rule is generally never allow deletes on these tables. So, I never create indexes on FK columns for these tables. An index could be created temporarily if doing archiving or mass deleting.
The proper terms in relational databases are not Parent and Child tables but Referenced and Referencing. You'll notice Microsoft does follow these proper terms in system tables and documentation.
June 29, 2015 at 10:44 am
mauriciorpp (6/29/2015)
It's interesting to note that the index usage pattern can change drastically in a case-by-case basis.
It makes it all that much more confusing to me.
June 30, 2015 at 1:14 am
Regarding "Parent" and "Child" definition, you are right; though I have used these definitions because they are often used in articles, blogs and videos (regarding FKs) like the one of Kendra Little I have suggested in the introduction. Thanks for your notes.
June 30, 2015 at 5:21 am
This article triggers a modeling thoughts in me.
Lets categorize some FK relationships first:
1. Collections such as order-lines in an order, or class members in a class.
Objectively we can say these are relationships where child's have equal lifespan as the parent.
2. References to locations or other static entities.
Objectively we can say these are relationships where child's have less or equal lifespan as the parent.
3. Contained references to entities that are closely related (point to a primary contact that is found in a collection of contacts).
Objectively we can say these are relationships in which a child is not part of a collection, but does reference one that is related! Lifespans are less clear here but they are typically correlated!
Collections
I usually model these typical parent-child relations in a way that the child's first PK field is actually the FK field to the parent. In which case there is never need for a separate index specifically for the FK. I do this to get a leaner indexed model that still does the job. The reasoning behind this is that properly knowing the ID of a child record implies knowing its context (the parent record ID).
The selectivity of the 1st PK field (used for statistics), might be a bit lower when there exist many childs per parent, but for most applications this will not be the case. And if one only knows the Parent ID, this modelling makes finding the child records is the least IO intensive as bookmark look-ups are avoided and the data is in order. For mutations, maintaining only the clustered index instead of also an additional FK index is better performing as well reducing possible lock issues.
References
For pure references it can be or cannot be beneficial to have index on a FK. If I want to get a list of all warehouses that are located in England, then obviously an index here is beneficial. But most likely I will not need this location context to do something with a specific warehouse and thus is will not become part of the primary key!
I will never delete a country just like that and have the delete cascade, so this case also will not demand in index on a foreign key.
Contained references
Assume we have a warehouse and related to that we have various collections for this warehouse (orders, contacts, etc.). These collections are likely to be modified and maybe even deleted.
Lets see how we deal with this for orders and contacts:
Lets assume orders are collections themselves here. Orders are unlikely be deleted, but merely marked as such by means of a status change! As such I would argue here the warehouse ID does not need to be part of the order PK, but an index on the warehouse FK is desired. It allows for a quick overview of all orders per warehouse and does not "burden" order-lines with the warehouse PK. And it is highly unlikely we ever need to ask the server to deliver all order-lines per warehouse!
Contacts is likely a list per warehouse and as such the warehouse is the "parent" in a collection. I would certainly have the warehouse ID as the first field in the contact PK (see collections). Now if we want to designate a primary contact for a warehouse and we know it will always be just one, it makes sense to do this on the warehouse level. This will form a reference into a child collection.
If we model the child collection PK as <Warehouse PK>+<Contact PK> then the reference into this child collection is automatically indexed trough the clustered index on warehouse!
My conclusion
By thoughtful modelling the need for indexes on foreign keys can be vastly reduced! In these examples there are just two. And with respect to collections just one because of a chain of collections (warehouses containing orders, containing order-lines). The longer a chain and/or the more such chains, the more indexes on foreign keys are required. But the majority can be cut off at the knees in the modeling phase (for the "leaf" collections).
Also consider the impact of indexes, that can be avoided!
1. They take up resources (CPU and storage)
2. They can cause locking issues when changes are applied at busy times.
3. They make the job of the query optimizer harder as more options have to be evaluated.
Indexes are wonderful, but be very selective when applying them!
June 30, 2015 at 11:01 pm
Mr or Mrs. 500,
first of all thanks for your considerations.
1. You are right; I used a recursive CTE in order to keep the code related to the article more compact (no script for tally table)
2. During all tests I have used the "OPTION (RECOMPILE)"; doing all the steps again without this option, I saw that Sql Server always recompiles the query, except after the script 11 (correctly)
September 16, 2016 at 6:52 am
Hi,
I have a suggestion regarding script 9...
"Script 9
Execute the test-query and filter for any customer you like. The index will not be used, no matter what your choice. It seems that Filtered Indexes do not help in this situation."
The filtered index isn't going to be selected unless you use exactly the same predicate in the test-query...
eg.
SELECT o.OrderId, o.CustomerId, o.OrderDate, c.ContactName
FROM Orders o
INNER JOIN Customers c ON o.CustomerId = c.CustomerId
WHERE c.ContactName = 'Cust n° 1'
AND c.CustomerId < 10 -- this predicate allows the query optimizer to consider the filtered index
AND c.CustomerId = 1
OPTION (RECOMPILE)
Secondly, I got a little lost half way through the article as we jumped into some scripts that increased the width of a column. I joined the dots back up further on. But I can't help feeling that this wasn't a great way to demonstrate the query optimizer evaluating IO costs when deciding to Scan or Seek. Increasing the number of rows is more closely aligned with real world production data growth, instead of increasing the width of columns over iterations.
But let me applaud you for writing the article and sharing your work.
September 16, 2016 at 9:40 am
I guess I am a bit 'perplexed' by this article. It was good. It was well written, and it did provide some useful information, but in the end the conclusion said (in my mind) if you create an index on a FK column(s), it might be used and it might not. It might be beneficial (less reads) if it is used and it might not be (more reads). And if it is not beneficial, wait a couple of days until more data is added to the table and then it might be.
peter-757102's comments were enligthening, and provided some insight that I am not sure the article did. If you use a warehouse code in your detail table (e.g. code 12 = Midwest) and you want to run a query on warehouse = 'Midwest' an index on the FK column warehouse_code is likely beneficial.
Other than that I am having an issue with resolving the conclusion to a real world application of the article other than, in the end, the conclusion is "it depends" and even then it can change. An index that is beneficial today might not be tomorrow. Which basically is going to be job security for the DBA that is monitoring performance stats.
September 17, 2016 at 7:19 am
The terms "parent" and "child" belong to network databases and not to SQL. We have "referenced table" and "referencing table"; a reference can be a self reference to the table or to other tables in the schema. The old terms like "link", "parent", "child" refer to pointer chains. SQL does not specify how something is done only what the results are
Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL
September 19, 2016 at 6:40 am
Non-unique indexes are never "required" because they don't constrain nor define schemata. Also, I don't think DBAs want job security through babysitting indexing needs; we have more important things to do. Each release seems to add more insight into performance monitoring and we are at the point where indexing maintenance could become automated, dropping unused indexes and creating new ones based on insert and delete ratios and table sizes over time. The non-unique indexing can be left to the Production DBAs, especially for foreign keys since they're mostly used on cascade deletes of the referenced table.
It all depends on response time required, number of users impacted, and ease of adding more RAID or RAM. It just seems like some DBAs get addicted to indexing and are penny wise and pound foolish with their own time versus buying more hardware.
September 19, 2016 at 7:27 am
I've read this article before and each time I learn something new.
September 23, 2016 at 11:15 pm
Given things like your [Customers] and [Orders] tables you would have to have a unique index on [Customers].[CustomerID] in order to set up the FK from the [Orders] table. I would have set up an index on [Orders].[CustomerID] pretty much reflexively. Finding the customer from an order is one thing. Yes this does the referential integrity and is required. However the index on the [Orders] table I would do since that speeds finding all orders that belong to a selected customer.
I avoid "never", "always", and other imperatives like that. Turning on either UPDATE or DELETE cascading is something you would have to force me to do.
ATBCharles Kincaid
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply