November 16, 2017 at 10:02 am
Hi I am a DBA who does a small bit of SSIS / TSQL development and have been given a project to read in files on a daily basis, and only add changed or new records to a target table. I've come up with an inelegant method to do it but was wondering is there an optimum way to achieve this, at the moment I'm only working with a few thousand records per file but will be scaling to several million pretty imminently. I've searched for best practice docs but haven't found anything that struck me as being a whole lot better than what I am doing now.
Basic process
1. read CSV file into a temporary staging table
2. identify records which aren't identical to records already in the target table
3. add the new changed records to the target table (not updating old records with new values).
If someone has a good blog post on this I'd be happy to check it out, not looking for the solution at a TSQL level, just guidance on the optimum way to get there.
thanks folks
November 16, 2017 at 10:07 am
DonkeyKing - Thursday, November 16, 2017 10:02 AMHi I am a DBA who does a small bit of SSIS / TSQL development and have been given a project to read in files on a daily basis, and only add changed or new records to a target table. I've come up with an inelegant method to do it but was wondering is there an optimum way to achieve this, at the moment I'm only working with a few thousand records per file but will be scaling to several million pretty imminently. I've searched for best practice docs but haven't found anything that struck me as being a whole lot better than what I am doing now.
Basic process
1. read CSV file into a temporary staging table
2. identify records which aren't identical to records already in the target table
3. add the new changed records to the target table (not updating old records with new values).
If someone has a good blog post on this I'd be happy to check it out, not looking for the solution at a TSQL level, just guidance on the optimum way to get there.thanks folks
You can do steps 2 and 3 at the same time. Either using EXISTS or EXCEPT in a SQL task.
November 16, 2017 at 10:41 am
DonkeyKing - Thursday, November 16, 2017 10:02 AMHi I am a DBA who does a small bit of SSIS / TSQL development and have been given a project to read in files on a daily basis, and only add changed or new records to a target table. I've come up with an inelegant method to do it but was wondering is there an optimum way to achieve this, at the moment I'm only working with a few thousand records per file but will be scaling to several million pretty imminently. I've searched for best practice docs but haven't found anything that struck me as being a whole lot better than what I am doing now.
Basic process
1. read CSV file into a temporary staging table
2. identify records which aren't identical to records already in the target table
3. add the new changed records to the target table (not updating old records with new values).
If someone has a good blog post on this I'd be happy to check it out, not looking for the solution at a TSQL level, just guidance on the optimum way to get there.thanks folks
Does (or can) your source data contain DateCreated & DateLastModified columns?
What does the primary key of your source data look like?
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
November 16, 2017 at 1:10 pm
No date created or modified on the data. Primary key is combined from 2 columns but any of up to 80 columns could change driving the need for a new record to be added.
November 16, 2017 at 1:16 pm
DonkeyKing - Thursday, November 16, 2017 1:10 PMNo date created or modified on the data. Primary key is combined from 2 columns but any of up to 80 columns could change driving the need for a new record to be added.
Wouldn't this scenario (ie, PK exists, but other columns have changed) result in an UPDATE, rather than an INSERT?
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
November 16, 2017 at 1:31 pm
Another question in addition to Phil's...will your target table be updated by anything outside of this process? For example, will other processes or a users be able to update this table or is it strictly managed by the process you are developing?
November 17, 2017 at 6:22 am
Thanks for all the replies gents.
The full flow of information is as follows :
1. Read the file in to initial load table, assign each file a batch ID
2. Push the new / different records to the staging table, bringing the batch id along. Records are sorted and assigned a rowed starting at 1 during the move from ETL table to staging table. (Each batch starts at 1). Existing older (different) versions of the record are ignored, but need to be retained in case of queries relating to older ETL processes.
3. All records with the batch id are written to a file which goes to another reporting system. Error report is generated indicating which row the error has occurred on (hence the unique rowed starting at 1 per batch.)
4. Records in the staging table which were flagged as errors by the external process are updated to indicate they were not accepted, remaining records from the same batch are flagged as accepted.
I'm looking at using MERGE statement, SSIS lookup and will take a look at "EXISTS or EXCEPT" as suggested above, would prefer to do it in TSQL than using package component (Lookup) based approach due to the fact its easier to release patches to TSQL code than a package.
November 17, 2017 at 6:50 am
DonkeyKing - Friday, November 17, 2017 6:22 AMThanks for all the replies gents.
The full flow of information is as follows :
1. Read the file in to initial load table, assign each file a batch ID
2. Push the new / different records to the staging table, bringing the batch id along. Records are sorted and assigned a rowed starting at 1 during the move from ETL table to staging table. (Each batch starts at 1). Existing older (different) versions of the record are ignored, but need to be retained in case of queries relating to older ETL processes.
3. All records with the batch id are written to a file which goes to another reporting system. Error report is generated indicating which row the error has occurred on (hence the unique rowed starting at 1 per batch.)
4. Records in the staging table which were flagged as errors by the external process are updated to indicate they were not accepted, remaining records from the same batch are flagged as accepted.
I'm looking at using MERGE statement, SSIS lookup and will take a look at "EXISTS or EXCEPT" as suggested above, would prefer to do it in TSQL than using package component (Lookup) based approach due to the fact its easier to release patches to TSQL code than a package.
While this is useful information, it does not answer either of the questions posed by John and me.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
November 17, 2017 at 8:50 am
Hi Phil, apologies, I thought I did. I guess you cant hear the voices in my head đŸ˜‰question from you - "Wouldn't this scenario (ie, PK exists, but other columns have changed) result in an UPDATE, rather than an INSERT?"
Existing older (different) versions of the record need to be retained in case of queries relating to older ETL processes. To expand on the answer above, not all customer records are written out in each file, only updated or new records, but if I need to reproduce last March's file I need to retain that version of the record, along with other records from that batch.
question from John - "will other processes or a users be able to update this table " - answered in step4 above, where the separate process flagging records as accepted or not is the only thing updating the records once they're written. This process however will not update any of the columns brought in by the ETL process, and its not amended by anything else.
November 20, 2017 at 5:34 am
You said you have 2 columns which combine to make a primary key.
And from the newly staged data, any (or none) of 80 columns could change*. (The changes could be found using EXISTS or EXCEPT as noted above.) If any changed columns are found you want to create a new record, not update the existing record.
This would create a duplicate primary key, would it not?
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply