Best way to handle incremental duplicate data in SQL Server?

  • I am processing .csv files daily which has persons data. Every person has a unique ID.

    My table has FileID and PersonUniqueID columns; it is working fine and saved to parsed table, i.e. with duplicate rows.

    Now would like to copy the unique data to new tables which is used by other applications.

    I can copy the data using simple SELECT using NOT EXISTS with DISTINCT clause but it I would need to track the history of existing PersonUniqueID.

    i.e. would need to keep latest PersonUniqueID in main table and previous rows belongs to him, need to copy them to a HISTORY table.

    Please let me know what the best approach to handle this case is.

    I would thinking about to use LOOP to traverse PersonUniqueID one by one, but it will take too much time for millions of rows of data.

  • In SQL, I would avoid a loop at all costs. I would estimate that 90% (likely a lot higher) of the time you do not need a loop.

    Now, my approach to this would be to sanitize the data before insert if possible and if not possible, then do the insert of the CSV file into a staging table and only move NEW records into the new table. One way to do it would be like you said with the NOT EXISTS, but you would want good indexes on that to prevent problems. Another approach would be with EXISTS and to delete records from staging that exist in the source tables.

    What I don't understand is why your PersonUniqueID value for a person would change. Once an ID is created for the person, when would their ID change? What causes it to change? Does it need to change?

    When I design tables like a person table, the ID is used to identify a single individual and will never change once created. Any time I need to do mapping of that user to anything else (files in your case), I can reliably say that user 1234 is me (for example). If the ID of a person changes every time you import the CSV, that feels like a lot of extra overhead.

    BUT an approach that could handle that is to drop the ID column on the person and in the mapping table just have the person name and the file ID and map between your person table and the mapping table by the person name. Then create some good indexes around the person name and you are good to go until you get 2 people hired with the same name... but you already have that problem by changing ID's.

     

    What I mean is if Brian Gale is hired and then a second Brian Gale gets hired, how will you know which Brian accessed the files for history purposes IF the ID is going to change with each CSV import?

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Use ROW_NUMBER() rather than a loop.  You would set up your ROW_NUMBER() so that the latest record is row number 1 and historical records are 2+.  If you want more details, you'll need to give us more information, preferably in the form of sample data (as scripts to create and insert data into a temp table).

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Similar to Brian, I like to stage and sanitize things like this. Without DDL or more details, I don't want to guess, but my general process is something like:

    • truncate staging table
    • load to staging table
    • clean data if needed
    • get matches with real table, possibly archive changes
    • use inner join to update real table from staging
    • use inner join with real table to delete staging table matches
    • insert the rest of staging into real

    HTH

  • You could add an IsCurrent bit column and set it to 1 for the most recent row and update older rows to 0. So no need for copying old data to other tables.

  • SCDs (specifically, type 6 Slowly Changing Dimensions) would work the best.  One of the better SCD methods is System Versioned Temp Tables.

    https://en.wikipedia.org/wiki/Slowly_changing_dimension

    I would avoid the use of a simple "IsCurrent" flag at all costs.  Use a data/time column instead.  Of you use SCD's properly (they have a start and end date for the period where a given row is valid), then you can very easily to point-in-time queries for all entities in the table.  The System Versioned Temp Tables support all that an much more.

    https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver16

     

     

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply