June 19, 2016 at 2:11 pm
It has always been my opinion and understanding that the best way to build a table is to start with an integer identity column and make that the primary key and also the clustered index.
The logic, as I understand it, is that the primary key and clustered index, being fundamental to good performance, will then work best with non-clustered indexes added, as the clustered index key is always added to any non-clustered indexes meaning that 2 look-ups are always required to find rows, the first to get the collection of keys for the clustered index then return the rows from the clustered index.
Recently I have had that methodology questioned, the suggestion being that identity columns are a waste of space and that the clustered index should be as fat as it needs to be so that all rows in the table are therefore ordered by that index, and as long as the queries are following the same basic structure no other indexes are required, leading to efficiencies because any look-ups are just a single look-up from the clustered index, also primary keys are not essential so there fore the clustered index does not need to be unique.
I should also point out that the institution that I am working at at the moment uses SQL Server standard edition, for both production and development, these are both data warehouse servers, so the databases are all running in simple recovery model. The particular table that has raised this issue is very large currently containing 1,060,167,492 rows meaning that additional indexes create a large overhead in space and CPU time in maintaining it. It would also be good to get any opinion on whether better performance could be expected by upgrading to enterprise edition, apart from the obvious option of partitioning that table
I would really appreciate anyone else's opinion
June 19, 2016 at 2:45 pm
andymay-496937 (6/19/2016)
as the clustered index key is always added to any non-clustered indexes meaning that 2 look-ups are always required to find rows, the first to get the collection of keys for the clustered index then return the rows from the clustered index.
Not necessarily. Unless the query requires columns not in the nonclustered index. If the nonclustered index contains all the columns needed for the query, then the clustered index doesn't have to be touched.
the suggestion being that identity columns are a waste of space and that the clustered index should be as fat as it needs to be so that all rows in the table are therefore ordered by that index, and as long as the queries are following the same basic structure no other indexes are required, leading to efficiencies because any look-ups are just a single look-up from the clustered index, also primary keys are not essential so there fore the clustered index does not need to be unique.
Few things there.
'As wide as possible' maxes out at 900 bytes/16 columns as that's the limit for any index key.
There's nothing special about the clustered index being 'ordered', all indexes are logically ordered by their key columns. No index enforces physical order (not that it matters,
For an index to be used as a seek, the query must filter on a left-based subset of the index key. Hence, unless ALL queries filter by a particular set of columns, you'll either need nonclustered indexes or you'll have to be content with table scans.
Primary keys are part of your database design. If you want a properly designed database with foreign keys enforcing integrity, then you need primary keys. If you have a schema-less, relationship-less set of data, may I suggest you consider a document database or a key-value store as they may suit your requirement better.
If you need to enforce uniqueness on columns other than the primary key, then you'll need nonclustered indexes, but again that's DB design not performance.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
June 19, 2016 at 3:38 pm
Thank you for your comments Gail, I always thought that good design was a basic requirement for good performance? Your comments suggest that good performance is not necessarily dependent on good design, if this is not that case why do we bother with "good" design.
The other thing is this table is, largely, a report source. While there are some look-ups that might be done from some column values, they are never likely to be used as foreign key constraints, as I say they may be used as foreign keys in queries, for reporting and analysis, but never for actual foreign key constraint. This table is purely a data mart table used for data analysis/reporting.
The problems really happen when the data is being added to, my normal practice is to use merges to get changes in he data rather that deleting and re-populating. In this case, because that table is so large maintaining indexes has become an enormous overhead, and merging really needs good indexing, the normal practice here is to delete the previous records, for the period being added, and insert the new records, again because of the size of the table, this only really works if there are very few indexes, preferably just a clustered index which is wide enough to make is as unique as possible.
The next thing is am I fooling myself that a table is best to have a primary key? Your comments suggest that the only reason for a primary key is to create foreign key constraints. If this is the case then for data like this it would seem to be an unnecessary overhead?
Regards
Andy
June 20, 2016 at 5:31 am
andymay-496937 (6/19/2016)
Thank you for your comments Gail, I always thought that good design was a basic requirement for good performance?
Yup, bad design will almost certainly result in sub-standard performance
Your comments suggest that good performance is not necessarily dependent on good design
Um, where did I say don't bother designing the DB properly?
The other thing is this table is, largely, a report source. While there are some look-ups that might be done from some column values, they are never likely to be used as foreign key constraints
There's nothing about reporting systems which imply no foreign keys. Yes, I know people do it, that doesn't mean that 'reporting' = 'no integrity'
The next thing is am I fooling myself that a table is best to have a primary key? Your comments suggest that the only reason for a primary key is to create foreign key constraints.
I think you might want to re-read my comments.
Primary keys are part of your database design. If you want a properly designed database, with foreign keys enforcing integrity, then you need primary keys
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
June 20, 2016 at 6:52 am
andymay-496937 (6/19/2016)
I always thought that good design was a basic requirement for good performance? Your comments suggest that good performance is not necessarily dependent on good design, if this is not that case why do we bother with "good" design.Andy
Good design (table structure implied here) is always important to good performance but 1) it's no guarantee if someone doesn't know how to write code properly and 2) if someone does know how to write good code, they can frequently overcome some of the problems of bad design although it does take extra time and effort to overcome the bad design.
Worse yet, the people who initially create a bad design are frequently the same ones in 1) above. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
June 20, 2016 at 4:59 pm
Hi Gail
Thank you very much for your comments, they are very helpful. One little point is that in a data warehousing situation you honestly believe that we're going to add referential integrity? These data marts are built by SSIS packages from sources that vary from other databases to files derived from all over the place. Referential integrity would just slow down the build of the data, and since there are no users inputting data directly who would we inform when the integrity is broken?
While I agree that a well designed transactional DB needs referential integrity, I question the logic of doing it in a data warehouse situation
Regards
Andy
June 20, 2016 at 5:06 pm
Hi Jeff
I also believe that good design has its own pay-offs, in particular down the track when something goes wrong troubleshooting a well designed database is much easier than one that is not well designed. I think it is important to consider the differences when talking about databases designed as data marts. I know that the modern thinking is that all data marts should be star schemas and multi-dimensional cubes, but anyone who thinks that that is the only way it gets done is fooling themselves, I'm afraid.
I am a great believer in best practice, but it always must be tempered with business need
Regards
Andy
June 21, 2016 at 5:29 am
andymay-496937 (6/20/2016)
One little point is that in a data warehousing situation you honestly believe that we're going to add referential integrity?
Given the phrasing of the question, I guess you're not going to. I hope your SSIS packages have no bugs and never fail and leave the data inconsistent.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
June 21, 2016 at 10:43 am
Hi Gail
So do I, anyway Microsoft would never give us buggy code, would they? 😉
While I applaud your belief in a purest approach to database design, I will not be adding foreign key constraints to my data marts. One of the reasons being that data sources are not always processed in the correct order to expect the data integrity is always in place all the time. Also data in data marts can be for different sources meaning that such constraints would only work against getting all data in place, also the fact that dimension tables are quite commonly used across multiple data sets constraints would just make the whole process unwieldy at best.
I will continue to try to use good database design, where it is appropriate.
Regards
Andy
June 21, 2016 at 12:24 pm
Good database design is needed in data warehouse and data mart systems just as it is needed OLTP systems.
Data Warehouses pulling data from multiple sources should cleanse and standardize (not the actual term I am looking for, but close) the data for reporting.
SSIS itself isn't what Gail was referring to regarding your ETL process leaving the data in an inconsistent state, it is the code you and your developers develop in the packages.
June 21, 2016 at 12:32 pm
andymay-496937 (6/20/2016)
While I agree that a well designed transactional DB needs referential integrity, I question the logic of doing it in a data warehouse situation
Check this out. Be sure to read the example linked in the article as well.
http://www.scarydba.com/2015/09/09/yes-foreign-keys-help-performance/
If you have referential integrity in your transactional DB, I see no reason to avoid it in a DW.
June 21, 2016 at 12:47 pm
andymay-496937 (6/21/2016)
So do I, anyway Microsoft would never give us buggy code, would they? 😉
I wasn't referring to Microsoft's code, I was referring to yours. I've seen enough databases with data integrity problems, OLTP and DW to know that it happens.
You're dead set on this, so I'm not going to waste any more time trying to change your mind.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
June 22, 2016 at 7:40 am
One little point is that in a data warehousing situation you honestly believe that we're going to add referential integrity? These data marts are built by SSIS packages from sources that vary from other databases to files derived from all over the place. Referential integrity would just slow down the build of the data, and since there are no users inputting data directly who would we inform when the integrity is broken?
While I agree that a well designed transactional DB needs referential integrity, I question the logic of doing it in a data warehouse situation
Throwing the red flag on this comment. OLAP and OLTP design methods are not the same, but both require referential integrity. I've seen both types without them, and both end up with issues and incorrect data. In an OLAP environment you always have the unknown bucket to assign data in a dimension if the information isn't there. In my view most of these cases are due to developers taking shortcuts. Sometimes you have to take shortcuts, but in this case you seem to think it's normal. Whether it's "normal" or not, it's not correct.
June 22, 2016 at 8:19 am
Thank you very much for your comments, they are very helpful. One little point is that in a data warehousing situation you honestly believe that we're going to add referential integrity? These data marts are built by SSIS packages from sources that vary from other databases to files derived from all over the place. Referential integrity would just slow down the build of the data, and since there are no users inputting data directly who would we inform when the integrity is broken?
While I agree that a well designed transactional DB needs referential integrity, I question the logic of doing it in a data warehouse situation
Do you have 100% control of the ETL process, the data sources, the application(s) that will be consuming them and do you have 100% accurate understanding of the the impact if said referential integrity is violated and 100% confidence that said impact is acceptable to the end users?
If the answer to any of those is no you likely want to have at least some basic constraints, if you're finding that simple things like primary keys or NOT NULL constraints as appropriate are slowing down your data load something else is likely wrong as well.
June 22, 2016 at 12:23 pm
I'm with RonKyle on this one. I've built out many datamarts, some small, some large. All of them have included RI constraints from the fact tables to the dimensions. Rows that cannot resolve a dimension member get lumped into the unknown member (which I put in all my dim tables in a OLAP design) and later reported on for analysis, or they don't get loaded and go off for analysis/verification because they fell through the verifications in the ETL. So yes, foreign keys and indexes on your fact tables, and a clustered index; however, I do not create a primary key on fact tables unless I need to relate those facts to another fact table (in which case I add a surrogate key, identity, to the fact table). Reason for no primary key is that each row in a fact table should be the (most of the time unique) combination of the dimension intersection. A fact table represents the many-to-many relationships between it's dimensions.
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply