Using Large Integer Or GUID Vs Using Small Integers For Primary Key

  • Hi all,

    Would using UserID column (GUID) or BankAccount column (9 digit integer) as the primary key column be less efficient than using an ID column of type integer which auto increments?

  • SQLUSERMAN (12/25/2016)


    Hi all,

    Would using UserID column (GUID) or BankAccount column (9 digit integer) as the primary key column be less efficient than using an ID column of type integer which auto increments?

    As always, "it depends". Will those primary keys be used as foreign keys in other tables?

    😎

    The GUID is definitely not the best choice for a clustered index as it will not be inserted in any guaranteed order, hence constantly causing fragmentation of the clustered index.

  • Would using UserID column (GUID) or BankAccount column (9 digit integer) as the primary key column be less efficient than using an ID column of type integer which auto increments?

    I would not recommend that a GUID be used unless it was part of a database system that had to routinely be merged with other databases into a master database.

    As for the bank account column, it's generally better to use the actual business keys where possible. The bank account number information might require special protection, however. If that is the case, you would not want it propagating through a series of child tables. An auto incrementing identity or a sequence number would be generally advisable. If the number does not require special protection, that would probably be the better choice.

  • SQLUSERMAN (12/25/2016)


    Hi all,

    Would using UserID column (GUID) or BankAccount column (9 digit integer) as the primary key column be less efficient than using an ID column of type integer which auto increments?

    I agree that using a GUID is the wrong thing to do. On the general question, I recommend that "right sizing" is always the way to go especially for fixed width datatypes with the possible exceptions of the MONEY and SMALLDATETIME datatypes, which I generally avoid. The reason why is because the more narrow the data, the more rows will fit on a page and the less memory, IO, and disk footprint for both the datafiles and backups you'll need.

    Indexes will suffer the same benefits because they're just special tables.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just don't forget integer values run out at about 2.1 billion. We're running into that now on some primary key=integer identity columns and having to change to bigInt ( not a minor project )

  • How do you query that table, and how does it join to other tables? If it's by BankAccount, then definitely cluster by BankAccount value. If you need to encode the actual BankAccount value into a more generic int, do so, but don't use a meaningless identity value to cluster unless you have an actual reason to, i.e., never "default" a clustering key, select it for a reason.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (12/27/2016)


    How do you query that table, and how does it join to other tables? If it's by BankAccount, then definitely cluster by BankAccount value. If you need to encode the actual BankAccount value into a more generic int, do so, but don't use a meaningless identity value to cluster unless you have an actual reason to, i.e., never "default" a clustering key, select it for a reason.

    BankAccount numbers are nearly as bad as GUIDs and, I've found for my larger tables, clustering by the most commonly queried column rather than an ever increasing column can be much less effective than a nice, narrow NCI not to mention being quite a bit faster and using a lot fewer resources. Also, contrary to popular belief, BankAccount numbers can and to change more frequently than one might expect especially with the mergers we've had to suffer through on behalf of our bank customers.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I've found that account numbers (of various types, not just bank accounts) have meaningful leading zeros. So they must be stored as char/varchar.

    This could lend itself to using a int/bigint identity as a key. The trick of course is to ensure the account numbers remain unique in a 1-to-1 relationship with the key. Without some constraints, there's nothing prohibiting multiple records with the same account number, but different identity keys. Ugg.

  • I've found that account numbers (of various types, not just bank accounts) have meaningful leading zeros. So they must be stored as char/varchar.

    Excellent point. However, why can't that column still be a key provided there's no particular security issue with it?

  • RonKyle (12/28/2016)


    I've found that account numbers (of various types, not just bank accounts) have meaningful leading zeros. So they must be stored as char/varchar.

    Excellent point. However, why can't that column still be a key provided there's no particular security issue with it?

    It can be used. But there could be fragmentation if new accounts are added in some patterns, rather than in (generally) increasing values.

    For example, suppose the first 3 characters of the account number reflected the branch number where the account was opened. 001, 002, 003, ... etc. As each new account was opened, a new entry would need to be wedged after the existing entries for that branch, and before the next branch.

  • It can be used. But there could be fragmentation if new accounts are added in some patterns, rather than in (generally) increasing values.

    For example, suppose the first 3 characters of the account number reflected the branch number where the account was opened. 001, 002, 003, ... etc. As each new account was opened, a new entry would need to be wedged after the existing entries for that branch, and before the next branch.

    But then as long as you understood that, you might decide having that problem is better that trying to keep IDs and account numbers in a 1:1 relationship, right? I've always resisted not using the business key unless there was a good reason not to. And while performance is important, it seems to me that this method puts performance as an absolute first. Not trying to read too much into it, but given your explanation it seems that's where you'd head in a longer explanation.

  • Jeff Moden (12/27/2016)


    ScottPletcher (12/27/2016)


    How do you query that table, and how does it join to other tables? If it's by BankAccount, then definitely cluster by BankAccount value. If you need to encode the actual BankAccount value into a more generic int, do so, but don't use a meaningless identity value to cluster unless you have an actual reason to, i.e., never "default" a clustering key, select it for a reason.

    BankAccount numbers are nearly as bad as GUIDs and, I've found for my larger tables, clustering by the most commonly queried column rather than an ever increasing column can be much less effective than a nice, narrow NCI not to mention being quite a bit faster and using a lot fewer resources. Also, contrary to popular belief, BankAccount numbers can and to change more frequently than one might expect especially with the mergers we've had to suffer through on behalf of our bank customers.

    Then you encode the actual BankAccount value into a related int, as I stated above. That value initially could be gotten from an identity column in another table or a SEQUENCE value, but it wouldn't change once assigned, even if the BankAccount value did. But it would always represent that one account.

    I'm talking about overall performance, not just the INSERT. After all, there's typically Ks, 100Ks, Ms, etc., of SELECTs per INSERT. If a given approach speeds up every SELECT -- even if it did mildly degrade the INSERT in a particular case, which often it doesn't that much anyway -- it could still easily be worth it. On very large tables, such an approach often allows merge joins rather than hash or, yikes, loop joins.

    Edit: Adjusted wording.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • RonKyle (12/28/2016)


    It can be used. But there could be fragmentation if new accounts are added in some patterns, rather than in (generally) increasing values.

    For example, suppose the first 3 characters of the account number reflected the branch number where the account was opened. 001, 002, 003, ... etc. As each new account was opened, a new entry would need to be wedged after the existing entries for that branch, and before the next branch.

    But then as long as you understood that, you might decide having that problem is better that trying to keep IDs and account numbers in a 1:1 relationship, right? I've always resisted not using the business key unless there was a good reason not to. And while performance is important, it seems to me that this method puts performance as an absolute first. Not trying to read too much into it, but given your explanation it seems that's where you'd head in a longer explanation.

    Granted, yes it's a trade-off. It depends on what needs to be accessed and how fast, and given those requirements, what do you do about the architecture to make that happen.

    You don't blindly say "Never use identities" any more than you would say "Always use identities" for keys.

  • ScottPletcher (12/28/2016)


    Then you encode the actual BankAccount value into a related int, as I stated above. That value initially could be gotten from an identity column in another table or a SEQUENCE value, but it wouldn't change once assigned, even if the BankAccount value did. But it would always represent that one account.

    Man, do I ever agree with THAT! Like I said, I have to go through hell and back because, although they did such a mapping where I work, they didn't bloody well use it.

    They also built (can't say "designed" because they put no thought into it, whatsoever) most of the big tables in a highly denormalized fashion and the CI doesn't actually perform well because it's between 100 and 140 columns wide on the big tables. As a result, both the CI (the mapped IDENTITY column) and the AK (LoanNumber, ClientID) are fairly well useless by themselves for anything having to do with performance including SELECTs and so they've necessarily got a fair number of indexes and FKs (sometimes in the dozens) on many of the big tables.

    I'm talking about overall performance, not just the INSERT. After all, there's typically Ks, 100Ks, Ms, etc., of SELECTs per INSERT. If a given approach speeds up every SELECT -- even if it did mildly degrade the INSERT in a particular case, which often it doesn't that much anyway -- it could still easily be worth it. On very large tables, such an approach often allows merge joins rather than hash or, yikes, loop joins.

    Yep, absolutely understood and agree. I'm talking about both, though. For the Expedia site, INSERTs where as high as SELECTs and sometimes higher because they logged everything (similar to what DoubleClick.Net does with their bloody "spotlight" pixels on graphics). For the company I currently work for, there's a constant flux of new data for large batch jobs and GUI inputs alike. The performance of INSERTs, UPDATEs, and DELETEs is just as critical as the SELECTs. Heh... and it's damned painful at times especially with the logging that we have to do to be compliant (banking, loans, and lender placed insurance industry). It used to take up to a minute to do a single row delete from the LOAN table because they didn't understand the concept of how to make the dozens of FK checks quick (because the table IS highly denormalized and has way too many other tables that depend on it).

    Even when the PK/CI column is the leading column in a query, it's dog slow for multi-row returns simply because of the low row count (usually just 1 or 2) per page and so it's become necessary to "duplicate" some of the data using an NCI with the same leading column as the CI as well as some includes. When it comes to the heavy batch processing (which a lot of people forget to consider, they usually just thing GUI performance), it's a vicious circle. The indexes are necessary to make the SELECTs for the batches fast but has the exact opposite effect on the INSERTs/DELETEs because those affect ALL the indexes and FKs.

    It's all a huge waste of disk space and, in many cases, memory.

    Sorry... that sounds like a rant towards you and I don't mean it that way. I'm just trying to explain that things aren't as simple as "selecting the right CI".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You don't blindly say "Never use identities" any more than you would say "Always use identities" for keys.

    I would never say either of those statements. But as some general rules, I would say to use a business key where possible, and that performance is second to data integrity unless the data integrity requirement creates unacceptable impacts (rare). There are other factors, but that's not the purpose of this thread.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply