March 12, 2024 at 4:43 am
Hi everyone
I am in the process of updating my current SSIS so it handles incremental loading/processing of new files vs the current one that does not do this.
Background...I have a SSIS package that downloads data from a SFTP server and then runs a bunch of SP. The problem is that the SPs are run for records that have already been processed so that means that historical records are re-cal'd each day in addition to new records. Only new records need to be processed by the SP. The old ones should not be processed again.
I need to get some guidance on how to best handle the re-writing of SSIS and SP. I am fairly new to SS so I want to make sure I don't make the same mistakes I did in the past. Here is my initial thoughts and concerns...
Proposal 1 - Have a processed flag added to main table
When new files are loaded the flag is set to 'N'. The SP will only process records where processed_flag = 'N'. Once SP is done then update processed_flag to 'Y' using this type of query:
update dbo.MainTable
set processed_flag = 'Y'
The challenge with this approach is that isn't remotely efficiently (as per my initial testing/analysis). The table has test data in it. There are about 35million records. The full dataset is close to 4.5billion records. It takes about 20 minutes for above query to update 35million records. If the run time is linear with input size (big assumption) then that means it will take close to 41 hours to do the above update on 4.5billion records. No where close to being acceptable. Is my approach the right one to take? Maybe the idea is a good one but my implementation isn't the most efficient way to do it?
The advantage of this approach is that it means that updating my SP will be really straightforward. I would add "...where MainTable.processed_flag = 'N'" and that is it.
Proposal 2 - Create a new SS table that tracks which dates have been processed
The table will have two fields - one for date and another for processed_flag. The processed_flag will be either Y or N. Each SP will be joined to this table via the date field. Then I would add "where table.processed_flag = 'N'" to ensure that only new records are processed. Once the SPs are done then update the processed_flag to 'Y'. Given the importance of the table I would do everything in a Begin Transaction Block so if one SP fails then rollback will reverse all changes.
The data in question is financial data. This table will have around 3000 records (one for each trading day) so updating/referencing this table will be much smaller than the one in Proposal 1. The table will grow each day by 1 record so overall it is not a big table.
I have not tested Proposal 2 but I am assuming it will be better in terms of performance. I am bit concerned that there will be more joins so that may negatively impact run time of SPs.
What does the community think? Am I on the right track? Is there another better way to do it? Is there anything else I should be aware of?
I just want to make sure I am on the right track before I spend the time on the re-write.
Any guidance you can offer is really appreciated.
Thank you
March 12, 2024 at 8:40 am
How many tables are updated during this process?
Does each table have date created / date modified columns?
Is the incoming data purely inserts, or inserts + updates, or inserts + updates + deletes?
Do the tables use incrementing primary keys?
Could you import the data into a staging table, run the updates on that and then push the updated data into the target tables?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
March 12, 2024 at 3:15 pm
How many tables are updated during this process?
Does each table have date created / date modified columns?
Is the incoming data purely inserts, or inserts + updates, or inserts + updates + deletes?
Do the tables use incrementing primary keys?
Could you import the data into a staging table, run the updates on that and then push the updated data into the target tables?
Thanks for the reply.
March 12, 2024 at 3:37 pm
I prefer the 2nd approach, but for the 1st approach:
The processed_flag should be a bit, 0/1, rather than a full char (y/n). If you already have a bit column(s) in the table, it may not even require an additional byte to add it.
When you update the processed flag, you would only update ones that are 0, not all rows (over again):
update dbo.MainTable
set processed_flag = 1
where processed_flag = 0
That might make it better.
Also, this would seem to me to be a case where you'd use a staging table. That is, new loads go into a staging table, are processed there, and only when complete are transferred to the main table.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply