August 2, 2023 at 7:20 am
I am processing .csv files daily which has persons data. Every person has a unique ID.
My table has FileID and PersonUniqueID columns; it is working fine and saved to parsed table, i.e. with duplicate rows.
Now would like to copy the unique data to new tables which is used by other applications.
I can copy the data using simple SELECT using NOT EXISTS with DISTINCT clause but it I would need to track the history of existing PersonUniqueID.
i.e. would need to keep latest PersonUniqueID in main table and previous rows belongs to him, need to copy them to a HISTORY table.
Please let me know what the best approach to handle this case is.
I would thinking about to use LOOP to traverse PersonUniqueID one by one, but it will take too much time for millions of rows of data.
August 2, 2023 at 2:10 pm
In SQL, I would avoid a loop at all costs. I would estimate that 90% (likely a lot higher) of the time you do not need a loop.
Now, my approach to this would be to sanitize the data before insert if possible and if not possible, then do the insert of the CSV file into a staging table and only move NEW records into the new table. One way to do it would be like you said with the NOT EXISTS, but you would want good indexes on that to prevent problems. Another approach would be with EXISTS and to delete records from staging that exist in the source tables.
What I don't understand is why your PersonUniqueID value for a person would change. Once an ID is created for the person, when would their ID change? What causes it to change? Does it need to change?
When I design tables like a person table, the ID is used to identify a single individual and will never change once created. Any time I need to do mapping of that user to anything else (files in your case), I can reliably say that user 1234 is me (for example). If the ID of a person changes every time you import the CSV, that feels like a lot of extra overhead.
BUT an approach that could handle that is to drop the ID column on the person and in the mapping table just have the person name and the file ID and map between your person table and the mapping table by the person name. Then create some good indexes around the person name and you are good to go until you get 2 people hired with the same name... but you already have that problem by changing ID's.
What I mean is if Brian Gale is hired and then a second Brian Gale gets hired, how will you know which Brian accessed the files for history purposes IF the ID is going to change with each CSV import?
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
August 2, 2023 at 8:30 pm
Use ROW_NUMBER()
rather than a loop. You would set up your ROW_NUMBER()
so that the latest record is row number 1 and historical records are 2+. If you want more details, you'll need to give us more information, preferably in the form of sample data (as scripts to create and insert data into a temp table).
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
August 2, 2023 at 9:03 pm
Similar to Brian, I like to stage and sanitize things like this. Without DDL or more details, I don't want to guess, but my general process is something like:
HTH
August 2, 2023 at 10:17 pm
You could add an IsCurrent
bit column and set it to 1 for the most recent row and update older rows to 0. So no need for copying old data to other tables.
August 2, 2023 at 11:37 pm
SCDs (specifically, type 6 Slowly Changing Dimensions) would work the best. One of the better SCD methods is System Versioned Temp Tables.
https://en.wikipedia.org/wiki/Slowly_changing_dimension
I would avoid the use of a simple "IsCurrent" flag at all costs. Use a data/time column instead. Of you use SCD's properly (they have a start and end date for the period where a given row is valid), then you can very easily to point-in-time queries for all entities in the table. The System Versioned Temp Tables support all that an much more.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply