Identity columns and primary keys in a many to many example

  • I have a question on database design.

    I have 3 tables, a Vendor table and a Product table, and a many to many mapping table.

    each with 2 columns (id is int and identity and primary key)

    id, vendor name

    and

    id, product description

    I need to make a table with the combination of the 2 , i.e. which vendor sells which products.

    I created the table with 2 columns ,

    vendor_id, product_id

    Both fields are part of the primary key, as the combination will always be unique.

    I ran a database analysis tool on this database, and it reported that I should not have a segmented primary key, and I should have an identity primary key for performance reasons, and define a unique constraint on the 2 fields if that is what i meant to do.

    I wondered why this would benefit performance ?

    Any comments ? idea's ?

  • Can't wait to see what others have to say about this one.

    I think you're alright in this case. 

    I've used the identity option in a few cases.  IIRC, it was when the NN mapping actually became a new entity that had refferences to it in one or a few more tables.  That way it made sens to have a new id so that you don't repeat the 2 ids in 5-6 other tables. but maybe my database design skills were just much weaker back then...

     

    Also one draw back is that you now have to join 2-3 tables now to be able to filter the data, instead of having it right then and there in the table.

     

    That's why I can't wait to see what others have to say about this one!

  • I disagree!   YOU DEFINATELY ARE RIGHT! 

    The tool is full of it. 

    Scroll through this post.  The example you state is actually posted as an example of something NOT TO DO!

    http://www.sqlservercentral.com/forums/shwmessage.aspx?forumid=361&messageid=385407

     

  • If the intersection table is being used in other relationships besides the M:M relationship between Vendor and Product (as in Ninja's example), then it may make sense to create an identity key field.

    One question is what is meant by "segmented primary key." Does that mean it is a composite key or that it is unclustered? If it means composite, that's not a problem and the tool is just making a stock recommendation for any composite key it sees -- which may or may not be good advice for a particular table. If it means unclustered, that points out a genuine concern but is easily fixed in your DDL.

    The trouble with automated analysis tools is that they, by necessity, must make many assumptions. They only see tables. They cannot see tables in use. If you post the DDL of the three tables and an example query or two, we can see both the tables and their use.

    However, if the intersection table only appears like this:

      select <whatever>
      from   Vender V
      (left) join Vend2Prod I
          on V.PK = I.VendorPK
      join Product P
          on P.PK = I.ProductPK
      ...

    then having a surrogate key field on the intersection table would be useless. Explain how having a surrogate key would make the query more efficient. Try writing a query that would ever use that key. I would be interested in the result.

    Ninja, you made one statement:

      Also one draw back is that you now have to join 2-3 tables
      now to be able to filter the data, instead of having it
      right then and there in the table.

    I'm sorry, but I have read this several times and can't quite get the exact nature of this drawback. Are you saying there is a way to resolve a M:M relationship without using a join?

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

  • I have no idea where a surrogate key would useful on a "glue" table like this.  The only way I would see this being useful is if you have an orders table where you would store the surrogate key instead of the vendor id and the product id, but then you have to use the "glue" table when you generate any reports based on product and/or vendor.

  • In my case that NN table had a key about an access object.  From that table, I was drilling down all the way from Form to the form's object's properties.  The whole drill down was like 3-4 levels deep (IIRC).

     

    My point was that if I created a new key for the NN table, I now had to join from that table all the way down to the properties table to be able to apply the filter for X object.

    If I had not replaced the PK with an identity column, I would have been able to apply the filter in any table I needed to without using joins but I would now have duplicate information (or more exactly, composite PKs of 3-4-5 columns which I was not very fond of).

    In that case I chose simplicity over "data duplication" even if it meant a few more joins.  That system was for my exclusive use and performance loss caused by the join was an absolute non-issue.  Especialy considering that there was very little data in the DB (0.25 M rows, 90 MB) and that the indexing was impeccable.

  • Stephen,

    I agree... you did it correctly and the tool is wrong.  Obviously, the tool doesn't know what the heck a cross-reference table is. 

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I tend to agree also.  Like Tomm Carr said, automated tools can only do so much.  In this case I would probably have a Primary Clustered index on Vendor_Id, Product_Id and a second index on just Product_Id just in case you are going to need to join just on the Product_Id.  For example: How many Vendors use this Product.

    Ninja, If I understand correctly you are doing a M:M:M or even a M:M:M:M.  Where you have a Vendor, Product cross.  Then you cross the result of that with say an Order table.  Correct? 

    -- Ken Fisher

  • Here's the whole tree (very roughly presented) :

    --> Sql servers

    --> Databases

    --> Access data projects

    --> Top level objects (Forms, reports, macros...)

    --> FormsObjects

    --> FormsObjectsProperties

    --> FormsObjectsPropertiesEvents

    --> FormsObjectsPropertiesCodes (I was insterting the code in 4000 char blocks here like in syscomments)

     

    They are not all MM, but you can see that if you want to get ObjectsEventsCode for Form X, you have quite a few joins to perform to be able to do it if you create a new id column for each level... which is what I had done in that case.  The other part of the tree that you are not seeing is roughly like this (servers / databases / Tables / Columns).  I was then cross-referencing the columns' usage per objectsEventsCodes.  That's why I had some objects with like 4-5 FK constraints and that led me to the choice of creating a new ID column for almost each new level.  That way I was not carrying 3-4-5 keys at the leaf levels of the tree.

     

    To tell you the truth, I'm still not too sure how I'd do it this time around.  But I can tell you that I'd probably try it the other way around, just to see the other side of the coin.

  • I made the mistake, early on in my DBA work, of having an identity primary key and a unique index on the two join columns, for several many-to-many join tables.  Caused a serious performance problem after the tables got filled with a few hundred thousand rows of data.  Altering the tables to dump the identity column and make the two join IDs the primary key, with the primary key clustered, and adding an index on the reverse lookup (primary key: ID1, ID2; index: ID2, ID1), made the joins through the table much faster.  Inserts were slightly slower, because of the clustered index (with the identity as the primary, the data was always added at the end of the file, not so with the other solution), but the overall performance increase was significant.

    Many of the selects in that database involved joins through as many as four or five many-to-many relations.  Fixing this made a noticeable difference to the users.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • This is all very good advice, but something has been left out.  Auditing and meta-data.  If this is not a concern, then so be it.  However, trying to come up with a 'good' or 'easy' auditing solution with M:M relationships that have meta-data (user changed, datetime changed, user added, datetime added, or any other specific information about the relationship) without having a surrogate key is messy.  What if you don't want to duplicate all the tables in your database for history?

    One may argue that this is not just a M:M glue table if it has meta-data attatched, it is it's own 'object'.  In that case I would say a surrogate key is unnecessary.  I don't use straight 'glue' tables.  I want to know information about the relationship.  This is a trade-off -> Information complicates my 'M:M' table -> However, now I have more information.

    Everything is a trade-off.  "If I want this - I have to give up that".  You must look at the situation you are in and design until you meet your goal (performance - elegance - extensibility - functionality - etc.).  Don't strictly follow a 'tools' advice and don't take the word of every poster you read.  Most primary key discussions end up being so fatalistic it's a wonder that anybody's database is remotely functional (if you ask the 'other' camp)!

  • Rule #1 in any rulebook: Exceptions exist!

    Primary keys: It is critically important to establish the natural key field(s) of a table. This is the foundation on which is built all the subsequent data integrity checks. This is important even if, for whatever reason, you decide to (or are ordered to by fiat) create an identity column to serve as primary key. The identity column is a surrogate key, not the real or natural key. There can be many sound reasons for the creation of a surrogate key but it does not relieve us of the effort of correctly identifying the natural key.

    The characteristics of a primary key is that it must be unique and it cannot be null. If we define the field as a primary key, this is automatically taken care of by the underlying system. When we create a surrogate key, however, we have to manually take care of this. We have to create a unique index on the natural key field and...and...AND specifically designate the field as NOT NULL. I am constantly surprised at how often one or the other but not both are done.

    Foreign keys: Define the foreign key constraint! Again, I see this far too often. It is obvious from the schema and subsequent queries that a field is a foreign key to some table, but the the constraint has not been defined. And when you define the constraint, specify what should happen should the referenced record be deleted. It may make sense to delete the child records or to set the foreign key field of the child record to NULL or a previously defined default value. It is almost never a good idea to allow the default action: ON DELETE NO ACTION.

    Intersection tables: Used to implement a many-to-many relationship between entities. An example would be Subassemblies and Parts--a subassembly may be made up of many parts and a part may be used in many subassemblies. So the intersection table contains a foreign key referring to the Subassemblies table and a foreign key referring to the Parts table. Both together make up the natural key of the intersection table.

    Does it make sense to create an identity surrogate key field for this table? Not really. What purpose would it serve? How could the field even be used in any dml? If we need to answer questions such as "What parts are needed to make subassembly S?" or "Is part P used to make subassembly S?" then we must obtain the primary key value for subassembly S (and part P for the second question) to use in joining the three table to find the answer. Any surrogate key in the intersection table plays no role in this.

    However, suppose we add a field to the intersection table. Subassembly S may consist of 10 P1 parts, 17 P2 parts and 1 P3 part. So we add the field Quantity to the intersection table. Now we know not only that P1, P2 and P3 are needed for subassembly S, but also how many. While we're at it, we add DateCreated and DateChanged field for tracking purposes.

    Now does it make sense to create a surrogate key on the intersection table? No, not really. Any particular row of the intersection table identifies a relationship between one subassembly and one part. Any information we desire of this table will be in the context of one or more subassemblies and/or one or more parts. The surrogate key is superfluous.

    Look at it this way. In order to find the surrogate key value for a particular subassembly and a particular part, we must know the primary key values for both the subassembly and the part. If we already have those values anyway, we already have all the information we need to use the intersection table to give us any information we want.

    But suppose we go ahead and query the intersection table for subassembly S, part P and get the surrogate key value for that particular relationship. Now what? We use it to go back in to get the number of parts needed and/or the date the relationship was created or any other information we keep there? But we could have gotten all that information the first time we queried that table! Moreover, if we need to perform a join to get the information we need, the surrogate key value is useless to us.

    Under what condition would we ever use the surrogate key value of an intersection table? I know of none. I am willing to admit that such conditions exist, however, and there have been several vague references to their use -- but so far no actual samples.

    Rule #2 in any rulebook: Even to Rule #1!

    <Addendum>: Ok, I can think of one way for a surrogate key field on an intersection table to have a use -- if there is yet another table that has a foreign key reference to the intersection table. I have never seen such a beast -- and can't, right off the top of my head, think of any use for this. I would be interested in knowing if someone has done this and for what purpose.</Addendum>

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply