How To Approach Adding A primary Key To An Existing Table

  • In our vendor created/managed DB we have a table that is like an audit table the vendor uses for tracking down issues. It's something you can turn on or off via the application that uses the DB.  The table has NO Primary Key. It does have a date stamp like column so we've used it as the way to determine order of operations. The problem is it's no consistent. When I make a change it stamps it with my time, my time zone.  We know that it uses other times because that last X rows in the table based on this time stamp have times that have yet to come, that are 1 or more hours in the future.  Trying to get teh software vendor to explain this has proven difficult because they see it as something they use and not really clients. We can use it because we can access teh DB to query it. We can even modify some things like adding our own custom tables, views, UDF's and in some cases depending on the table, add triggers.

    What I'd like to try to do is add a PK to this thing so we can get the data form it in the actual order it occurred and not what the time stamp has. My concern is that it's an existing table with a LOT-O-ROWS and because it captures every Update, Insert and Edit command it's highly volatile.

    Thoughts? Ideas? Suggestions? Maybe even a prophecy or 2?

     

    Kindest Regards,

    Just say No to Facebook!
  • If the table is just a heap (you said there's no PK, but that doesn't necessarily mean there's no clustered index), as you know, you'll have to rewrite the whole table to add any kind of clustered index (PK or not).  In that case, you'd likely want to use the create-a-new-table-and-switch-over-to-it approach.  For that, you'd need the disk space and some way to tell new rows in that that table from old ones for rows added while you're copy the bulk of the old rows to the new table.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Well there is actually numerous ways this can be approached actually but most of them are going to greatly depend upon the layout of the current table as well as the data contained within it.

    Still addressing this specific case without taking everything else into account I think I hear the following:

    1)  We use a date-time stamp that changes based on the last insert or update as the key for the record -- bad practice.

    2) We want to add a PK (sequential number) to the records to track the order of execution -- good practice

    3) The date-time stamp sometimes has unexplainable future dates -- this needs to be resolved as to how you plan to sequence these. Personally I would sequence them as they come in regardless of the date-time stamp since you have already determined it is not a reliable piece of data.

    4) Now assuming you want to track the executions -- aka series of date-time stamps for insert and updated records this means you have to have a means to uniquely identify a single series of records.  For instance if Record 1 comes in at 1:00pm and then is updated at 2:00pm the date-time stamp changes but the sequence number does not, that is unless you assign it a new sequence that catalogs that update (which seems what one would do) but then you need a way to reference the parent record.

    A secondary solution might be to create a secondary table that tracks the changes with date-time stamps and denotes what was changed -- aka a change log. Then only on inserts does a new record get a sequential identifier and any updates to that record are recorded within the change log table.

    If you have any more information to better outline the issue, then perhaps that might be helpful in helping us help you. Otherwise I am currently available to take on assignments 😉 if you would like me to dive into this more deeply.

  • Scott/Dennis - Thanks for such quick replies. I didn't provide the DDL for teh table originally so it's listed below.

    CREATE TABLE [dbo].[TRACK](
    [hRecord] [NUMERIC](18, 0) NULL,
    [dDateIn] [DATETIME] NULL,
    [sUserName] [VARCHAR](120) NULL,
    [sTableName] [VARCHAR](60) NULL,
    [sActionType] [VARCHAR](6) NULL,
    [sSqlLine] [VARCHAR](2000) NULL
    ) ON [PRIMARY]
    GO

    CREATE NONCLUSTERED INDEX [I_Track_1] ON [dbo].[TRACK]
    (
    [sUserName] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    GO

    CREATE NONCLUSTERED INDEX [I_Track_2] ON [dbo].[TRACK]
    (
    [dDateIn] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
    GO

    Kindest Regards,

    Just say No to Facebook!
  • Okay you did not answer the question I posed about whether or not the date-time stamp gets changed based on an Insert versus an Update -- that was an assumption that I was making. Knowing that helps greatly in understanding the problem.

    So yes you will definitely need to add a column to the current table as Scott outlines all though, I have done this without creating a new table but this greatly will depend on the size of the table and the nature of the data. So again upon parameters that we are not aware of.

    Also what is hRecord -- this seems like it should be a unique identifier for each record if perhaps combined with the date-time stamp maybe? Everything else within it looks like a typical change log table

  • I don't know if the time changes based on the type of change.  I have no idea why some items are future dated.  I do know from limited testing that when I do something within the application, regardless of the type of change,  it gets recorded with my actual time; my time zone.  Since there are future dated items in it we can only speculate that this is true for all DML commands.  IF however I make a change via SSMS, that doesn't get recorded.  The capturing of changes is being managed by the software app itself  so any changes done outside the app aren't captured.  I know this is a BAD design but I didn't make it I'm just stuck with it.  When we asked the vendor how the time stamp works they said it is done based on the clients time zone but clearly that's not completely true since there are future dated items.

    As for the hRecord, I too initially thought that would be a key until I looked at it. This column is NULL in every row. My guess is it was something planned to serve as a key but for whatever reason it didn't get implemented.

    In terms of the amount of data, it varies. AS t current there are 94,986,439 rows and the row with the oldest time stamp is 45 days ago.  I can tell you it's not always 45 days. At one point there was 3 months worth. We have asked the vendor how this w0rks, what determine how much data is retained and we were told 3 months but this is clearly not the case.  One time when I checked there was close to 6 months worth but that was in the first few months that this thing was enabled.  So we can't make any assumptions on how much data it will always retain.

    Apologies if I miss something but I'm trying to respond to the posts while also in a virtual class.

    Kindest Regards,

    Just say No to Facebook!
  • It's nice that there's already an index on dDateIn. I would suggest something like this:

    --Setup

    (A1) Create a new table, with a usable clustered index. Add data compression to the table (clus index) if it:

    (1) is available (2) is allowed and (3) would save significant space (it usually does!).

    Use proc sys.sp_estimate_data_compression_savings 'dbo', 'TRACK', NULL, NULL, 'PAGE' to see how much space would be saved.

    --Process

    (B1) Capture the process start datetime (@max_date_to_copy).

    <Loop>

    (B2) Copy rows from the existing table into the new table, based on dDateIn:

    (1) Copy TOP (some number), where (some number) is chosen so that it is as high as reasonably possible while still allowing SQL to:

    (A) Use the dDateIn index to lookup the rows (check the query plan); and

    (B) INSERTs run reasonably quickly.

    For example, something roughly like below (uncomment the sequence code if you want a unique key value):

    /*CREATE SEQUENCE dbo.TRACK_sequence AS bigint START WITH 1 INCREMENT BY 1 NO CYCLE;*/
    GO
    CREATE TABLE [dbo].[TRACK_NEW](
    [hRecord] [NUMERIC](18, 0) NULL,
    [dDateIn] [DATETIME] NULL,
    /*[iSequence] bigint NOT NULL CONSTRAINT TRACK__DF_iSequence DEFAULT NEXT VALUE FOR dbo.TRACK_sequence,*/
    [sUserName] [VARCHAR](120) NULL,
    [sTableName] [VARCHAR](60) NULL,
    [sActionType] [VARCHAR](6) NULL,
    [sSqlLine] [VARCHAR](2000) NULL
    ) ON [PRIMARY];

    CREATE /*UNIQUE*/ CLUSTERED INDEX [TRACK__CL] ON dbo.TRACK_NEW
    ( dDateIn /*, iSequence */) WITH ( DATA_COMPRESSION=PAGE, DROP_EXISTING=OFF, ONLINE=OFF, SORT_IN_TEMPDB=ON, FILLFACTOR=98 ) ON [PRIMARY];

    GO

    DECLARE @dDateIn_start datetime;
    DECLARE @dDateIn_end datetime;
    DECLARE @max_date_to_copy datetime;
    DECLARE @rows_at_a_time int;

    /* change any/all of these values as needed (for example, to restart for a given datetime) */
    /* !!For best performance, check the query plan to verify that SQL does use the nonclus index to */
    /* do the INSERT INTO dbo.TRACK_NEW based on this value of @rows_at_a_time!! */
    SET @rows_at_a_time = 100000;
    SET @max_date_to_copy = DATEADD(MS, -3, CAST(GETDATE() AS date));
    SET @dDateIn_start = '19000101';

    WHILE 1 = 1
    BEGIN
    IF @dDateIn_start >= @max_date_to_copy
    BREAK;

    /* based on dDateIn order, find the dDateIn value @rows_at_a_time values ahead of the current value; */
    /* SQL can use the nonclus index to do this, so it should be very fast */
    SELECT TOP (@rows_at_a_time) @dDateIn_end = dDateIn
    FROM dbo.TRACK
    WHERE dDateIn >= @dDateIn_start
    ORDER BY dDateIn;
    IF @@ROWCOUNT = 0
    BREAK;

    IF @dDateIn_end > @max_date_to_copy
    SET @dDateIn_end = @max_date_to_copy;

    /* Since there is no unique key, the number of rows copied may be somewhat more than the specified @rows_at_a_time */
    /* !!For best performance, check the query plan to verify that SQL does use the nonclus index to *//* do the INSERT INTO dbo.TRACK_NEW based on this value of @rows_at_a_time!! */
    INSERT INTO dbo.TRACK_NEW /*(...)*/
    SELECT * /*..., ...,*/FROM dbo.TRACK
    WHERE dDateIn >= @dDateIn_start AND dDateIn <= @dDateIn_end
    ORDER BY dDateIn;

    SET @dDateIn_start = DATEADD(MS, 3, @dDateIn_end);

    /*WAITFOR DELAY '00:00:01'*/
    END /*WNILE*/
    /* at this point, all rows before today should be copied */
    /* now, copy rows from today from time 00:00:00.003 to, say, five minutes before current time */
    /* then, create the other nonclus index: */
    CREATE NONCLUSTERED INDEX [TRACK__IX_sUserName] ON [dbo].[TRACK_NEW]
    (
    [sUserName]
    ) WITH ( DATA_COMPRESSION=PAGE, DROP_EXISTING=OFF, ONLINE=OFF, SORT_IN_TEMPDB=ON, FILLFACTOR=90 ) ON [PRIMARY];
    GO

    /* finally, copy all rows from five minutes ago to end of table, then rename the tables */

    EXEC sys.sp_rename 'dbo.TRACK', 'TRACK_ORIGINAL'

    EXEC sys.sp_rename 'dbo.TRACK_NEW', 'TRACK'

    /*Verify no rows are missing from the last few minutes, then, whenever you are comfortable with the change,

    drop table dbo.TRACK_ORIGINAL.*/

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • My advice is actually going to be the opposite of a lot of others here - I wouldn't add any columns to the table or change the structure. The reason being that if the application that reads and writes to that table is expecting things to exist in a specific way, adding a column may result in the app misbehaving. For example, if the application is doing a SELECT * on the table and expected 10 columns back and you add a column, the app may not handle that scenario.

    IF you really want to have a table for this, what I would do is make a new table and have a trigger on the current tracking table that copies the data from INSERTED into your new table. Double the disk space used by that auditing table, but reduces the risk that you are going to break the app.

    My advice with ANY 3rd party vendor tool is to not touch any of their base tables. If you need to add/remove anything, do it in a new table and have a trigger on it. If you want to keep the disk space usage lower (ie not duplicate the whole table), calculate a hash on the row and use that hash as a map between the base table and your custom one which stores additional details.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • My main advice would also be not to modify the table structure unless you really need to (I deliberately put the code in comments that (if you wanted to) changed the structure.)

    However, I would change the date index to clustered and I would compress the table.  Those should be transparent to their app: the compression certainly will be.

     

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • I agree with Brian... check with the 3rd party before you mess with one of their tables.  While this is just a simple "audit" table that won't even stand up in a court of law, you don't want to void a warranty or any possible future help or prevent future updates, etc.

    Once you find out if you can use and populate the column and consider what the table is used for and it's current size, I'd set it up for automatic monthly table partitioning, possible with yearly (or monthly) file groups so that you can set the older months to READ ONLY to help relive backups and a couple of other things.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • How to Approach Adding a Primary Key to an Existing Table

    In relational databases, a primary key is a crucial element that uniquely identifies each record in a table. It ensures data integrity by preventing duplicate records and facilitating efficient data retrieval. However, there may be situations where a primary key was not initially defined for a table, or the need for one arises due to changing data requirements. Adding a primary key to an existing table is a task that requires careful planning and consideration. This article outlines a step-by-step approach to safely and effectively implement a primary key in an existing table.

    1. Assess the Current Table Structure and Data

    Before adding a primary key, it is essential to thoroughly understand the current structure and contents of the table. This includes:

    Reviewing the Data: Examine the table's data to ensure there are no duplicate rows or NULL values, as these can cause issues when creating a primary key.

    Identifying Potential Key Columns: Determine which columns can serve as a unique identifier. The ideal primary key column should uniquely identify each row and be as short and stable as possible.

    Data Analysis: Check for consistency and cleanliness in the potential key column(s). If the table lacks a clear candidate for a primary key, consider creating a new column specifically for this purpose.

    2. Prepare the Table for the Primary Key

    Once you have identified the potential primary key column(s), prepare the table to ensure a smooth transition:

    Clean the Data: Remove any duplicate rows or NULL values in the columns intended to be the primary key. This step is crucial because primary keys must contain unique, non-null values.

    Create a New Column if Necessary: If the table lacks a suitable column for a primary key, add a new column. This column can be populated with unique values, such as an auto-incrementing integer, to serve as the primary key.

    3. Backup the Database

    Before making any structural changes, it's vital to back up the database. This backup provides a recovery point in case anything goes wrong during the primary key addition process.

    4. Test the Changes in a Development Environment

    Implement the changes in a non-production environment to identify potential issues. This step allows you to test the impact of the new primary key on existing queries, indexes, and application logic. It's also an opportunity to optimize the new key structure if needed.

    5. Implement the Primary Key in the Production Database

    After successfully testing the changes, proceed to implement the primary key in the production environment:

    Add the Primary Key Constraint: Use an appropriate SQL command to add the primary key constraint to the table. For instance, in SQL Server, you might use:

    sqlCopy code

    ALTER TABLE table_name ADD PRIMARY KEY (column_name);

    Ensure that the command is tailored to your specific database management system (DBMS).

    Monitor the Change: After implementing the primary key, closely monitor the database for any unusual behavior or performance issues. This includes checking for any application errors that may arise from the new constraint.

    6. Update Documentation and Application Logic

    Once the primary key is in place, update all relevant documentation and application logic. This includes:

    Database Schema Documentation: Reflect the new primary key in your database schema documentation.

    Application Code: Update any code that interacts with the database to recognize and leverage the new primary key. This step is crucial to avoid runtime errors or logic issues in your applications.

    7. Plan for Regular Maintenance and Monitoring

    Finally, establish a plan for regular maintenance and monitoring of the table:

    Index Optimization: Review and optimize indexes to ensure they complement the new primary key and do not degrade performance.

    Data Integrity Checks: Implement regular checks to ensure data integrity, including monitoring for any changes that could introduce duplicates or NULL values into the primary key column.

    Conclusion

    Adding a primary key to an existing table is a critical task that enhances data integrity and query performance. By carefully planning the process, thoroughly testing changes, and ensuring robust documentation and monitoring, you can successfully implement a primary key without disrupting existing systems. This proactive approach not only safeguards the data but also optimizes the overall database performance.

  • This was removed by the editor as SPAM

  • Jalaal Akber wrote:

    Index Optimization: Review and optimize indexes to ensure they complement the new primary key and do not degrade performance. Data Integrity Checks: Implement regular checks to ensure data integrity, including monitoring for any changes that could introduce duplicates or NULL values into the primary key column.

    The above bit of advice is useless. The primary key, by definition, must be unique, so you will never have duplicates. NULLS are also impossible in a primary key column by definition of the primary key. The primary key is there to ensure that each row is unique. Microsoft docs say that for a primary key - "All columns defined within a primary key constraint must be defined as not null" and "each combination of values from all the columns in the primary key constraint definition must be unique". Link - https://learn.microsoft.com/en-us/sql/relational-databases/tables/primary-and-foreign-key-constraints?view=sql-server-ver16

    The quoted post sounds like it was written by an AI and nobody read and understood the reply before passing the information along...

    I do agree with ScottPletcher about some changes having no impact to the application, but there is also always risk with making changes that go against what the vendor set up and recommends. Like Jeff said, it could void your warranty and support. I've seen some vendors where adding an index on one of their tables means you lose support and a new index shouldn't cause any issues except a small performance hit on CRUD operations. Mind you, even adding a trigger can ruin support, so my advice of basically cloning the data with a trigger can be bad too. Using an SSIS job to migrate the data on a schedule (unfortunately wouldn't be real-time that way) could be an option, but pretty sure that is not a good option for your specific scenario... CDC (Change Data Capture) may be an option though but again, may result in loss of support/warranty. I have one system where if my database is in FULL recovery mode, I lose support for "running in an unsupported configuration". The database must be in simple recovery mode and I have no clue why. I argued with the vendor on that until they stopped replying and I decided support was more important than a good RPO. The app owner agreed.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply