February 18, 2011 at 6:29 am
I have a data warehouse that I inherited. When I inherited this data warehouse it had no indexes to speak of on 90% of the tables in the DB. As I worked on rectifying this I got some of the queries the users used on the DB and looked at the data relationships to best-guess where the indexes should be on the tables. I was allowed to redesign the tables and reload all of the data on a new server, from scratch.
The data came in text files from a mainframe that were in a format of one line has all parent and child records. There were varying multiple occurrences of child records. The parent record section had a field which told me the number of occurrences for a given child record and I parsed out the line, for X number of bytes to get each record. All child records went into separate tables with a key field to the parent table. In some cases there was a natural key, but in most cases I used a surrogate key.
I put a clustered index on everything, using what should be the key field between all of the tables in a given data set, whether it was a natural key or surrogate key. I put additional indexes on the tables according to my best guess from available usage data. The performance improvement was dramatic. Now that it has been in production a while, I am looking at the index usage statistics and SQL Server's recommendations for indexes to verify my guessing and improve the design, if possible.
What I have found is that SQL Server recommends some indexes that are a combination of certain fields plus all the rest of the fields in that table to be included, with the Include clause. Curiously, the surrogate key for a given table may not be included in this index, but it IS the 'Includes' section of said index. I realize that I should pay attention to this since it is using real stats of how the data is being used, but if I see and index recommendation like this, does that signify a need to change my clustered index to be this index of a combination of fields, less the includes?
I'm also trying to decide if I should be dropping some of my existing indexes, if they have low usage, to put in place the indexes SQL Server suggests. I need to watch how much space I am allocating to indexes, as, for instance, with a 22G table I am using 10G in indexes already, and if I were to create one of the indexes it suggests, with every field included, that index would match the size of the table.
Thanks,
Chris
Learning something new on every visit to SSC. Hoping to pass it on to someone else.
February 18, 2011 at 6:40 am
The Indexes that are recommended are not always right. This is something you will need to test. What I would suggest for you is to look for a blog by Glenn Berry where he talks about his 5 super DMV queries. One of them will show Bad Indexes, (Writes > Reads) and also identify your bottlenecks with other queries.
-Roy
February 18, 2011 at 8:16 am
I'll go Roy one step further and suggest that you have to approach the index hints with extreme caution, especially if it's adding a large list of INCLUDE columns.
It sounds like you're taking the right approach by checking the tables and adding a cluster as a starting point. A piece of advice, don't worry so much about putting the cluser on the key, but instead putting the cluster on the most frequently access path to the data. That is probably the key on most tables, but on some it might be a foreign key or even a different column or columns.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
February 18, 2011 at 11:40 am
The way I approached this was to look at the key between the various tables, which would be the field everyone joins on, thinking that would be the best field to have a clustered index on. I have 12 different data sets in this warehouse. When I got the project the data had no Identity fields, indexes, or what I call standard fields like date stamp and load ID (which load did this record come in on, for troubleshooting). Every now and then, a load is identified as having bad data after it has been loaded to Prod and having the load ID on each record allows us to cull that data from the DW more easily.
The first thing I did was add the "standard" fields to every table, like those mentioned above, then evaluate the key fields the data would join on. The parent table's key field, in a few cases, is the "internal control number" in the data from the source (I abandoned this once I found out that this number could be repeated in the data and management wants ALL data to be loaded, regardless of repetitions). I then began using a surrogate key, which gives me integrity over what may be poor data, once it is loaded into the DW.
I am not using primary keys or strict relationships, but I AM using unique clustered indexes on surrogate keyed tables and clustered indexes on the other parent keys. The parent key field is replicated to all of the child tables, as well as the other "standard" fields to maintain data integrity and in most cases, the child table has the clustered index on the foreign key field relating to back to the parent table. ID field for each table have a unique non-clustered index in them, to enforce data integrity.
I put other indexes on various fields in both parent and child tables, depending on the join or criteria fields used by the queries I got from the users. Some indexes have one field, some have multiples and includes as well. In some cases queries I needed to run against the DB allowed me to use the execution plan to suggest helpful indexes.
My capacity in this position is ETL developer, not DBA, so I do not have the run of the DB server like when I was a DBA, so I am trying to get permissions to get to these DMVs so I can do in-depth analysis. For the moment I have to submit my queries to the DBA to run for me, as his schedule allows. If I cannot get rights it will just take a bit longer to get done, no big deal.
As I said, my impression of SQL Server's suggestions for missing indexes, is that maybe, considering how the users are utilizing the data, a 5-field clustered index, in lieu of the 5 field non-clustered, with all other fields included, would be a better design than my current clustered index on the join field. I really need to plan this out correctly since it may mean rebuilding a clustered index on a table with 100-200M rows.
Thanks,
Chris
Learning something new on every visit to SSC. Hoping to pass it on to someone else.
February 18, 2011 at 4:23 pm
Please allow me to start one step before the indexing strategy.
What DWH model are you following? is it a Star schema or is it a Relational model?
For basic indexing, in general...
...for a star schema is pretty much a given to have one index on each FACT-DIM FK relationship plus PKs on FACT and DIM tables, usually relying on surrogate keys for DIM tables.
...for a relational schema indexing strategy usually follows navigation that becomes apparent in ER model.
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.February 21, 2011 at 6:28 am
Relational Model.
Thanks,
Chris
Learning something new on every visit to SSC. Hoping to pass it on to someone else.
February 21, 2011 at 5:52 pm
Stamey (2/21/2011)
Relational Model.
Interesting choice for a new Data Warehouse - nothing wrong about it, just interesting.
In this particular case basic indexing should include...
1- Unique indexes on all PK constraints on entities defined during ER modeling.
2- Non-unique indexes serving all relationships defined during ER modeling.
As a second level of indexing I would ask each department served by the data warehouse to submit their five more important queries or reports; analyze predicates and add indexes not included during basic indexing.
Hope this helps.
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.February 22, 2011 at 8:49 am
I have already done the bets guess indexing, of fields like you suggested and fields that are used for joins or criteria on some of the queries I have been able to look at. What surprised me was that SQL Server suggested some indexes that consist of multiple fields with ALL of the other fields in the table in the Include clause.
I take this to mean that SQL server thinks it can do a more efficient job of returned results if it has an index organized in this fashion. If that is the case, would it be wise to change my clustered index to look like the fields SQL Server suggested for this new index, and don't worry about the included fields, since 1. they cannot be a part of a clustered index, and 2, because the included fields will be in the same place in the table as they would be if I built this index as non-clustered.
What it seems to me is that it would make more sense, if I can trust SQL Server's recommendation, to change the clustered index instead of essentially duplicating my entire table for the sake of the index SQL Server suggests.
What I am trying to gather here, from people more knowledgeable about SQL Server indexing and how it suggests indexing, is a reality check. Is what I am thinking about this sensible?
Here is an example of what it suggests, for a table called InstMain. The FK field for all tables is called "ClaimID":
CREATE NONCLUSTERED INDEX IX_150 ON [FL_MPI].[dbo].[InstMAIN]
([ClaimStatusCode], [ClaimCodeType], [RecordStatusID], [AmountPaid], [FirstServiceDate])
INCLUDE ([MainID], [RecipientID], [ClaimID], [FormerClaimID], [NumDetails],
[ClaimStatusTypeCode], [ProviderPatientID], [BillTypeCode], [AdmissionDate], [AdmitHour],
[AdmitTypecode], [PatientStatusCode], [RecipientPayAmount], [DateBilled], [AmountBilled],
[LastServiceDate], [ProviderID], [AltProviderID], [AltProviderID2], [ProviderSigned], [ClerkID],
[ReimbursementAmount], [DispropShareAmount], [StmntPeriodDays], [PotentialElecBiller],
[CertCode], [OverheadAmount], [ProvScopeCode], [ProvTypeCode], [ProvCountyCode],
[AdmissionCode], [MedRecNum], [ClaimFrequency], [OtherInsuranceDenied],
[DaysNotCovered], [FinalizeDateFirstRun], [FinalizeDateLastRun],
[EncounterInterchangeStatus], [IsEncounter], [ProviderBillingOverride],
[ProviderFacilityOverride], [OtherProviderOverride], [ProviderMedicaidID], [FacilityMedicaidID],
[OtherProvMedicaidID], [DetailCount], [OccurrenceCount], [INPCount], [DiagXCount],
[ValueCount], [ConditionCount], [HeaderCount], [ICD9Count], [PayerCount],
[CrossOverCount], [NHCount], [TreatmentCount], [JobKey], [InsertDateTime]) WITH
PAD_INDEX, FILLFACTOR = 90
Here is what's on that table now:
CREATE CLUSTERED INDEX [IX_InstMAIN_NumICN] ON [dbo].[InstMAIN]
(
[ClaimID] ASC
)
CREATE NONCLUSTERED INDEX [nIdx_Composite] ON [dbo].[InstMAIN]
(
[ClaimID] ASC,
FormerClaimID Asc,
[FirstServiceDate] ASC,
[LastServiceDate] ASC,
[ProviderMedicaidID] ASC,
ProviderID Asc
)
CREATE NONCLUSTERED INDEX [nIdx_JobKey] ON [dbo].[InstMAIN]
(
ClaimID Asc, [JobKey] ASC
)
CREATE NONCLUSTERED INDEX [nIdx_RECORDSTATUSID] ON [dbo].[InstMAIN]
(
[RecordStatusID] ASC
)
Create Unique Index idxMainID on InstMain (MainID Asc)
Create Index idxMainInsertDate On InstMain (InsertDateTime Asc)
CREATE NONCLUSTERED INDEX [idxInstMain1stServDate]
ON [dbo].[InstMAIN] ([FirstServiceDate])
INCLUDE ([ClaimID])
SQL Server does suggest other indexes on this table, but those are of less concern because they won't double the size of the space used by this data with one index.
Thanks,
Chris
Learning something new on every visit to SSC. Hoping to pass it on to someone else.
February 22, 2011 at 8:55 am
And in regards to the model being used, we don't have much choice here. This is used for an auditing application and management needs all data to remain in the tables, even duplicates, as a record of how we received the data. All we are doing is splitting the massive flat files we receive into a relational model to make it easier to mine data from it and so it takes up less space.
Chris
Learning something new on every visit to SSC. Hoping to pass it on to someone else.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply