No PKs on fact tables

  • jcelko212 32090 - Thursday, April 13, 2017 10:42 AM

    Jeff Moden - Thursday, April 13, 2017 6:47 AM

    jcelko212 32090 - Monday, April 3, 2017 12:55 PM

     what would you recommend the key for a "Personnel" table be for something like a National Bank?

    having done something like this in the financial industry years ago, I found that going into the accounting department and getting the license numbers for the personnel was the best approach.

    License numbers? Expand on that thought, please.

  • Lynn Pettis - Thursday, April 13, 2017 12:01 PM

    jcelko212 32090 - Thursday, April 13, 2017 10:42 AM

    Jeff Moden - Thursday, April 13, 2017 6:47 AM

    jcelko212 32090 - Monday, April 3, 2017 12:55 PM

     what would you recommend the key for a "Personnel" table be for something like a National Bank?

    having done something like this in the financial industry years ago, I found that going into the accounting department and getting the license numbers for the personnel was the best approach.

    License numbers? Expand on that thought, please.

    Some financial institutes only use licensed personnel in those license are controlled externally by licensing agency such as the SEC, or in some cases state agencies. It's illegal to hire them. If they don't have one.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 - Thursday, April 13, 2017 12:36 PM

    Lynn Pettis - Thursday, April 13, 2017 12:01 PM

    jcelko212 32090 - Thursday, April 13, 2017 10:42 AM

    Jeff Moden - Thursday, April 13, 2017 6:47 AM

    jcelko212 32090 - Monday, April 3, 2017 12:55 PM

     what would you recommend the key for a "Personnel" table be for something like a National Bank?

    having done something like this in the financial industry years ago, I found that going into the accounting department and getting the license numbers for the personnel was the best approach.

    License numbers? Expand on that thought, please.

    Some financial institutes only use licensed personnel in those license are controlled externally by licensing agency such as the SEC, or in some cases state agencies. It's illegal to hire them. If they don't have one.

    Okay, that will work for those required to have a license, what about thoe who don't need a license to work for those organizations?

  • I can give you a scnerio [sic] where I have no PK. I like being the black sheep.

    I have had large tables where 15+ uniques are needed to ensure row-level uniqueness and updates rarely happen. ETL fully controls the row-level uniqueness of the table and has yet to fail. Could primary keys still help, sure, but that would also be saying the ETL could fail too. I have yet to see that happen in my case, especially when the only way to get data into the table is through ETL. 

    If you like being the blacksheep, any comment I may make may not even matter.  I also have large tables that can't be uniquely described by any combinations of columns other than the single one that forms the operational database key.  I have seen too many times tables, both OLTP and OLAP, without the proper constraints, eventually have errors.  The former could also be only changed through ETL.  If you have truly developed a system that is error free, congratulations.  But it may be that you just don't know an issue isn't there.  I am certain there are no issues with my tables because they have been rigorously constructed using the referential integrity rules necessary to preserve referential integrity.  I worked with an architect who didn't use keys for his tables.  My first clue that things were wrong was that the production cube couldn't be processed without disabling the error checks.  He said I'd get used to it.  I never did.  Eventually I found issues with the data.  He told me something was wrong with the ETL.  I told him something was wrong with his design. 

    The rules exist for a reason.  I find that most people who don't follow thing didn't consider them but then have a good reason to not follow them.  They just didn't want to have to go through the rigor of following them.

  • jcelko212 32090 - Thursday, April 13, 2017 12:36 PM

    Lynn Pettis - Thursday, April 13, 2017 12:01 PM

    jcelko212 32090 - Thursday, April 13, 2017 10:42 AM

    Jeff Moden - Thursday, April 13, 2017 6:47 AM

    jcelko212 32090 - Monday, April 3, 2017 12:55 PM

     what would you recommend the key for a "Personnel" table be for something like a National Bank?

    having done something like this in the financial industry years ago, I found that going into the accounting department and getting the license numbers for the personnel was the best approach.

    License numbers? Expand on that thought, please.

    Some financial institutes only use licensed personnel in those license are controlled externally by licensing agency such as the SEC, or in some cases state agencies. It's illegal to hire them. If they don't have one.

    Good to know.  How about in a McDonalds franchise or a Walmart store where no such license nor any external unique identifier is required?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • RonKyle - Thursday, April 13, 2017 6:53 PM

    I can give you a scnerio [sic] where I have no PK. I like being the black sheep.

    I have had large tables where 15+ uniques are needed to ensure row-level uniqueness and updates rarely happen. ETL fully controls the row-level uniqueness of the table and has yet to fail. Could primary keys still help, sure, but that would also be saying the ETL could fail too. I have yet to see that happen in my case, especially when the only way to get data into the table is through ETL. 

    If you like being the blacksheep, any comment I may make may not even matter.  I also have large tables that can't be uniquely described by any combinations of columns other than the single one that forms the operational database key.  I have seen too many times tables, both OLTP and OLAP, without the proper constraints, eventually have errors.  The former could also be only changed through ETL.  If you have truly developed a system that is error free, congratulations.  But it may be that you just don't know an issue isn't there.  I am certain there are no issues with my tables because they have been rigorously constructed using the referential integrity rules necessary to preserve referential integrity.  I worked with an architect who didn't use keys for his tables.  My first clue that things were wrong was that the production cube couldn't be processed without disabling the error checks.  He said I'd get used to it.  I never did.  Eventually I found issues with the data.  He told me something was wrong with the ETL.  I told him something was wrong with his design. 

    The rules exist for a reason.  I find that most people who don't follow thing didn't consider them but then have a good reason to not follow them.  They just didn't want to have to go through the rigor of following them.

    Feel free to explain why a simple script as something like SELECT WHERE NOT EXIST or even MERGE would totally fail it's checks and still insert records WHERE IT DOES EXIST in batch mode only, especially when preceding code that actually removes duplicates from staging before going into primary. I personally have yet to see it fail, but that does not mean I'm objective to avoiding it and trying something different.

    And I would disagree that people who create systems to ensure row-level uniqueness is not maintained by ways of defining a traditional PK as someone not willing to go through the rigor of following them. If anything, they are going through more rigor than someone simply creating a PK. It's just as you said, they are subject to errors because you are relying on a human to think of the system than just using the machine to do it for you.

    I wouldn't say that every system or even my system is error free, but something as simple as ensuring row-level uniqueness is not exactly rocket science for a ETL system. I view PK constraints as methods to ensure row-level uniquness happens when you have more than one system touching that table. For example, multiple applications within multiple coders with lots of hands with concurrency and in OLTP environment.

  • Regarding this scenario, where a natural primary key would be composed of more than a handful of columns; in the past I've approached this by implementing a computed column on the HASHBYTES function and then created a non-clustered primary key or unique index on that. I don't do this often, actually it was the only occasion (a reporting datamart not an OLTP database). I remember it required some bit of tweaking. We were actually receiving data feeds from about 100 independent sources, so it was a struggle to discover a column combination that was consistently unique.

    It was several years and a couple of jobs back, so I don't have the actual code available for reference, but for what it's worth, this quick mock-up is a close enough approximation:


    CREATE TABLE MyFatDemographicTable
    (
        PointOfSaleID INT NOT NULL,
        VisitDate DATE NOT NULL,
        FirstName INT NOT NULL,
        LastName VARCHAR(20) NOT NULL,
        BirthDate VARCHAR(20) NOT NULL,
        ZipCode INT NOT NULL,
        col6 INT NOT NULL,
        col7 INT NOT NULL,
        adnauseam99 INT NOT NULL,
        adnauseam100 INT NOT NULL,

        -- MD5 hash (128 bits / 16 bytes)
        HashID as HASHBYTES('MD5'
         , CAST( PointOfSaleID AS VARBINARY(99))
         + CAST(VisitDate ASVARBINARY(99))
         + CAST(FirstName AS VARBINARY(99))
         + CAST(LastName AS VARBINARY(99))
         + CAST(BirthDate AS VARBINARY(99))
         + CAST(ZipCode AS VARBINARY(99)) ) PERSISTED,

        CONSTRAINT PK_MyFatDemographicTable
            PRIMARY KEY NONCLUSTERED ( HashID )
            WITH FILLFACTOR = 80
    );

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Just my thoughts on this.  You can, indeed, do whatever you want in your own shops.

    "Rigorous" doesn't mean "Difficult". 😉  If you want to guarantee uniqueness in a table, a unique index in one form or another is the only way to go. Although Microsoft has figured a way to make each row unique without you declaring it, it's a bit expensive behind the scenes.  Triggers and supposedly carefully crafted rigorous front end code can't guarantee uniqueness in the table because someone else can write code not so carefully crafted or can disable an enforcement trigger to overcome the uniqueness violation errors they're getting on the table directly.

    Some folks state that having such indexes to enforce uniqueness and referential integrity slows all code down.  I've found that doing such checks in code usually takes longer and still doesn't offer any guarantee.

    Yep... you could also disable an index, unique or otherwise.  That's a good part of the reason I don't give Developers the privs to promote their own code even in a staging environment, never mind a production environment.  Seemingly contrary, I do give the Developers "sysadmin" privs on the Development box and make sure there's plenty of real or obfuscated data to work with so that they can be innovative.  Of course, I've also explained that there are certain things that they must not do and why, like backups, restores, adding or dropping users/logins, and other certain things.  I've also limited what the Dev box and SQL Server logins can see and do externally.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yeah, it's a bit difficult for my scenario though. I'm billion with billions of rows. Triggers and other checks are removed to improve performance. Pretty similar to the recent article about performance over quality. When you remove those checks, the chances of quality degrading increase, especially if the systems for quality are in human hands. But, I do think certain areas can be substituted easy enough as long as they are considerations. As I said before, haven't ran into any issues with row-level uniqueness with the ETL fully controlling a single table that no other users and systems have write access to.

    My only worries in the past is the issues many DBA's have found with the use of MERGE.

  • Feel free to explain why a simple script as something like SELECT WHERE NOT EXIST or even MERGE would totally fail it's checks and still insert records WHERE IT DOES EXIST in batch mode only, especially when preceding code that actually removes duplicates from staging before going into primary. I personally have yet to see it fail ...

    I don't have to explain that.  I have seen the results.  Even in one small Access database that I had designed, somehow the join between the order and order detail was removed prior to its release to a remote office.  When I visited that office months later and noticed the join was missing, I tried to add it.  I couldn't, as there were two child records without a parent.  That was a tightly written desktop application.  The users could not have accessed the tables except through the interfaces.  There was not even an ability to delete an order.  So in theory, this should not have been a realistic possibility.  But it happened.  There has never been one single database without RI constraints that I've had to examine and not found multiple issues.  As I will never examine yours, maybe it would be the first exception. 

    I'm working on a new ETL system, and it's occurred to me that without RI constraints I would have to be checking during testing more than I am now.  As it is, I don't have to check for duplicates and orphans, because the RI constraints will fire error messages.  Which they have.  I will stick with the proper methods.  I have to answer for the accuracy of the data.

  • RonKyle - Monday, April 17, 2017 10:12 AM

    Feel free to explain why a simple script as something like SELECT WHERE NOT EXIST or even MERGE would totally fail it's checks and still insert records WHERE IT DOES EXIST in batch mode only, especially when preceding code that actually removes duplicates from staging before going into primary. I personally have yet to see it fail ...

    I don't have to explain that.  I have seen the results.  Even in one small Access database that I had designed, somehow the join between the order and order detail was removed prior to its release to a remote office.  When I visited that office months later and noticed the join was missing, I tried to add it.  I couldn't, as there were two child records without a parent.  That was a tightly written desktop application.  The users could not have accessed the tables except through the interfaces.  There was not even an ability to delete an order.  So in theory, this should not have been a realistic possibility.  But it happened.  There has never been one single database without RI constraints that I've had to examine and not found multiple issues.  As I will never examine yours, maybe it would be the first exception. 

    I'm working on a new ETL system, and it's occurred to me that without RI constraints I would have to be checking during testing more than I am now.  As it is, I don't have to check for duplicates and orphans, because the RI constraints will fire error messages.  Which they have.  I will stick with the proper methods.  I have to answer for the accuracy of the data.

    Sorry, but obviously you don't have good control over what is happening in the source systems. I can't imagine code mysteriously disappearing between environments unless there is not full control of the source. I mean, that's ideally a good use case for those constraints, especially if as a DBA, you have no control over what the code is using your tables. But, not in my case. Code is not mysteriously going to disappear and surely not have tables missing records I cannot fully explain, bug or not. Good developers can explain bugs. You technically design for them--just not intentionally.

    And using cases where clearly you can't explain what is happening, especially when source code is mysteriously changing is not a good enough example to prove why you shouldn't do something.

  • As I said earlier, since, as you say, you like being the black sheep, there's likely no explanation I can give.  I suspect if I could examine your system and find something wrong, your answer would be similar to the architect I talked about earlier in the post.  You would just say something is wrong with the ETL, but not the design.  Hopefully your system is as error free as you claim.  Good luck.

  • xsevensinzx - Friday, April 14, 2017 7:43 AM

    My only worries in the past is the issues many DBA's have found with the use of MERGE.

    Me too.  I still don't use it in SQL Server.  Doesn't seem to buy anything and considering the problems of the past, I still have a difficult time trusting it.  Maybe when 2020 (got the pun?) get's here. 😉

    Also, just curious... are you using any form of partitioning for your billions on billions tables?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Monday, April 17, 2017 3:44 PM

    xsevensinzx - Friday, April 14, 2017 7:43 AM

    My only worries in the past is the issues many DBA's have found with the use of MERGE.

    Me too.  I still don't use it in SQL Server.  Doesn't seem to buy anything and considering the problems of the past, I still have a difficult time trusting it.  Maybe when 2020 (got the pun?) get's here. 😉

    Also, just curious... are you using any form of partitioning for your billions on billions tables?

    Nope. I'm actually still on Standard edition, which as you know does not support this unless you use poor-man views, but you need PK's for that. Believe it or not, Standard edition works extremely well even with larger datasets if you set it up properly and have proper data warehousing practices. This is why I fell in love with SQL Server specifically because you actually don't always need Enterprise edition. The only issue I'm facing is the fact that you have memory limitations with Standard edition and when you get into really large datasets, you need loads of memory if you are doing extremely large reads. This is why I am considering migrating to more suitable solutions like Azure Data Warehouse where we can take advantage of columnstores and more across a proper MPP solutions versus the SMP we are in.

  • RonKyle - Monday, April 17, 2017 11:42 AM

    As I said earlier, since, as you say, you like being the black sheep, there's likely no explanation I can give.  I suspect if I could examine your system and find something wrong, your answer would be similar to the architect I talked about earlier in the post.  You would just say something is wrong with the ETL, but not the design.  Hopefully your system is as error free as you claim.  Good luck.

    Technically speaking, all systems will likely have bugs if you really search for them for long enough. However, what you speak of has nothing to do with the problem you face. While yes, adding those constraints helps protect your tables from bad data, it doesn't exactly address the problem on why you have code missing between environments or why certain things mysteriously happen you can't explain. It only allows you to apply a bandaid to the situation to help protect you from said problem, but does not fix the problem specifically. For that reason, it's not really comparing apples to apples because with that same logic, who's to say your constraints ALSO don't get removed between environments too. If you have no control over one aspect, you likely don't have control over everything.

Viewing 15 posts - 16 through 30 (of 41 total)

You must be logged in to reply to this topic. Login to reply