October 9, 2008 at 6:56 am
Hello Everyone,
I was reading an article, and in that article as one of the rules for the database architect being 'Every table has a Primary Key'.
My question: Would you create primary key even though there are only 4 values in a table say for example Regiontable in Northwind database.
By creating Primary key on such a small table would it help in any ways performance wise?
I would really appreciate any and all thoughts about it, in order to clarify myself.
Thanks all.
October 9, 2008 at 7:08 am
A primary key is nothing to do with performance, it is about uniquely identifying a row in your table.
If you don't create a primary key on your regions table, what is there to stop someone adding the same region more than once?
SQL Server automatically creates an index for the primary key, and it is this index which helps performance.
October 9, 2008 at 7:10 am
Yes, even on such a small table, you should have a primary key.
Even there, it allows the query optimizer to know things about the table that make it run faster. For example, it knows that the primary key column is unique (or columns), so it knows it can handle sub-queries in a certain way.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
October 9, 2008 at 7:11 am
The primary key in this case will not help performance, but it will not make it worse either 🙂 But it will add the benefit of enforcing that the rows will be unique 🙂
Regards,
Andras
October 9, 2008 at 8:53 pm
chintan kotwal (10/9/2008)
Hello Everyone,I was reading an article, and in that article as one of the rules for the database architect being 'Every table has a Primary Key'.
My question: Would you create primary key even though there are only 4 values in a table say for example Regiontable in Northwind database.
By creating Primary key on such a small table would it help in any ways performance wise?
I would really appreciate any and all thoughts about it, in order to clarify myself.
Thanks all.
Heh... sounds like one of my articles. 🙂
Along with what the others said, try this... create a simple table with no primary key... insert two identical rows. Now, try to delete just one of them. 😛
Sure, it can be done using some trickery with TOP or ROWCOUNT, but you shouldn't have to go through that.
There absolutely must be something on each row that uniquely identifies that one row, or it's not really a table. So, this is probably what you saw in the article, and it's true...
[font="Courier New"]--===== A table is not properly formed unless a Primary Key has been assigned[/font]
--Jeff Moden
Change is inevitable... Change for the better is not.
October 11, 2008 at 2:56 am
I love you too guys but not every table need to have a PK.
How about staging tables in a Data Warehouse environment?
As you should know Staging tables are nothing but work space to help ETL, you know for a fact uniqueness is enforced on the source dataset and you also know you are enforcing uniqueness on both FACT and DIM tables on the Data Warehouse side.
You also know -by design - you are truncating Staging tables at the beginning of each ETL process and on top of it you must do Full Table Scan because you are processing 100% of the staged rows.
So?... no every table has to have a PK, isn't it? 😀
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.October 11, 2008 at 8:21 am
PaulB (10/11/2008)
you know for a fact uniqueness is enforced on the source dataset
I wish all the source data that I had to work with was as clean as that...
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 11, 2008 at 10:32 am
GilaMonster (10/11/2008)
PaulB (10/11/2008)
you know for a fact uniqueness is enforced on the source datasetI wish all the source data that I had to work with was as clean as that...
and in the same phrase I said -please let me quote myself "...you also know you are enforcing uniqueness on both FACT and DIM tables on the Data Warehouse side" 😉
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.October 11, 2008 at 11:58 am
PaulB (10/11/2008)
I love you too guys but not every table need to have a PK.How about staging tables in a Data Warehouse environment?
As you should know Staging tables are nothing but work space to help ETL, you know for a fact uniqueness is enforced on the source dataset and you also know you are enforcing uniqueness on both FACT and DIM tables on the Data Warehouse side.
You also know -by design - you are truncating Staging tables at the beginning of each ETL process and on top of it you must do Full Table Scan because you are processing 100% of the staged rows.
So?... no every table has to have a PK, isn't it? 😀
I'll add an IDENTITY column as PK even on staging tables. Being stuck in the world of SQL Server 2000 means I don't have slick tools line ROW_NUMBER to help me delete duplicates and do other "checks". And, yes, I've found that it's a bit faster to delete or mark the dupes rather than avoid them on the import.
However... I agree that it doesn't need to be a PK to do this. I just lazy as hell and I find it easier to mark an IDENTITY column as a PK rather than creating an index. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
October 11, 2008 at 12:16 pm
Jeff Moden (10/11/2008)I'll add an IDENTITY column as PK even on staging tables. Being stuck in the world of SQL Server 2000 means I don't have slick tools line ROW_NUMBER to help me delete duplicates and do other "checks".
Jeff, by definition you do not delete on staging tables;
E - you truncate the table and load it from the source...
T - you do whatever transformation or massaging is needed...
L - you attempt to populated FACT and DIM tables -reporting failures.
A delete on a staging table is like saying "I don't care about this row coming from the source application", if that actually happens either the Data Warehouse is not a reflection of the business or the design is flawed.
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.October 11, 2008 at 12:33 pm
PaulB (10/11/2008)
Jeff, by definition you do not delete on staging tables;
By what definition?
A delete on a staging table is like saying "I don't care about this row coming from the source application", if that actually happens either the Data Warehouse is not a reflection of the business or the design is flawed.
So what do you do if you have duplicate rows coming from your source system? Insert straight into the warehouse, where you're enforcing uniqueness?
What happens if you know (because business has told you) that some of the data in the source is garbage?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 11, 2008 at 12:37 pm
Gail -- you are not reading my posting, let me quote myself once again 🙂
L - you attempt to populated FACT and DIM tables -reporting failures.
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.October 11, 2008 at 12:55 pm
I'm reading your posts just fine, thank you.
I don't see why I would want to fail the entire load if I know there's dups in the source data that, for whatever reason, cannot be removed from the source. Or are you loading 1 row at a time and just failing the single row?
Either I will load a distinct set into the staging tables, or I will delete dupes from the staging, or I will load distinct into the warehouse.
I've had a situation where my source system had 1 client's info repeated, identically, 76 times and the people who ran that system would not clean up their data. If I just loaded reporting failure then every single load would fail.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 11, 2008 at 1:06 pm
GilaMonster (10/11/2008)I don't see why I would want to fail the entire load if I know there's dups in the source data that, for whatever reason, cannot be removed from the source.
1- You do not "fail the entire load", that's just crazy. You reject the failed row/s and report them back to whoever you have to report them.
2- You do not keep loading the same data every single day, after the initial load you load what happened since your previous ETL process so you wouldn't attempt to load your 57 instances of the same client in a daily basis.
3- Your system knows about duplicates on the "L" phase of your ETL, when you attempt to move data from STG to FACT/DIM
_____________________________________
Pablo (Paul) Berzukov
Author of Understanding Database Administration available at Amazon and other bookstores.
Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.October 11, 2008 at 1:14 pm
PaulB (10/11/2008)
1- You do not "fail the entire load", that's just crazy. You reject the failed row/s and report them back to whoever you have to report them.
Then we have different ways of doing thing. I prefer to validate fully (including recording errors) in staging and then not even attempt to load rows that I know will fail. My load from staging to destination can then be a single insert/update and I don't need to process a row at a time to avoid a single failure rolling back the entire load.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 15 posts - 1 through 15 (of 40 total)
You must be logged in to reply to this topic. Login to reply