Add only changed or new rows to table

Question

Add only changed or new rows to table

LightShip

SSC-Addicted

Points: 421
More actions
November 16, 2017 at 10:02 am

#402674

Hi I am a DBA who does a small bit of SSIS / TSQL development and have been given a project to read in files on a daily basis, and only add changed or new records to a target table. I've come up with an inelegant method to do it but was wondering is there an optimum way to achieve this, at the moment I'm only working with a few thousand records per file but will be scaling to several million pretty imminently. I've searched for best practice docs but haven't found anything that struck me as being a whole lot better than what I am doing now.
Basic process
1. read CSV file into a temporary staging table
2. identify records which aren't identical to records already in the target table
3. add the new changed records to the target table (not updating old records with new values).
If someone has a good blog post on this I'd be happy to check it out, not looking for the solution at a TSQL level, just guidance on the optimum way to get there.
thanks folks

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Luis Cazares SSC Guru Points: 183706 More actions · Answer 1

DonkeyKing - Thursday, November 16, 2017 10:02 AM
Hi I am a DBA who does a small bit of SSIS / TSQL development and have been given a project to read in files on a daily basis, and only add changed or new records to a target table. I've come up with an inelegant method to do it but was wondering is there an optimum way to achieve this, at the moment I'm only working with a few thousand records per file but will be scaling to several million pretty imminently. I've searched for best practice docs but haven't found anything that struck me as being a whole lot better than what I am doing now.
Basic process
1. read CSV file into a temporary staging table
2. identify records which aren't identical to records already in the target table
3. add the new changed records to the target table (not updating old records with new values).
If someone has a good blog post on this I'd be happy to check it out, not looking for the solution at a TSQL level, just guidance on the optimum way to get there.
thanks folks

You can do steps 2 and 3 at the same time. Either using EXISTS or EXCEPT in a SQL task.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Phil Parkin SSC Guru Points: 246985 More actions · Answer 2

DonkeyKing - Thursday, November 16, 2017 10:02 AM
Hi I am a DBA who does a small bit of SSIS / TSQL development and have been given a project to read in files on a daily basis, and only add changed or new records to a target table. I've come up with an inelegant method to do it but was wondering is there an optimum way to achieve this, at the moment I'm only working with a few thousand records per file but will be scaling to several million pretty imminently. I've searched for best practice docs but haven't found anything that struck me as being a whole lot better than what I am doing now.
Basic process
1. read CSV file into a temporary staging table
2. identify records which aren't identical to records already in the target table
3. add the new changed records to the target table (not updating old records with new values).
If someone has a good blog post on this I'd be happy to check it out, not looking for the solution at a TSQL level, just guidance on the optimum way to get there.
thanks folks

Does (or can) your source data contain DateCreated & DateLastModified columns?
What does the primary key of your source data look like?

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

LightShip SSC-Addicted Points: 421 More actions · Answer 3

No date created or modified on the data. Primary key is combined from 2 columns but any of up to 80 columns could change driving the need for a new record to be added.

Phil Parkin SSC Guru Points: 246985 More actions · Answer 4

DonkeyKing - Thursday, November 16, 2017 1:10 PM
No date created or modified on the data. Primary key is combined from 2 columns but any of up to 80 columns could change driving the need for a new record to be added.

Wouldn't this scenario (ie, PK exists, but other columns have changed) result in an UPDATE, rather than an INSERT?

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

John Rowan SSC Guru Points: 56440 More actions · Answer 5

Another question in addition to Phil's...will your target table be updated by anything outside of this process? For example, will other processes or a users be able to update this table or is it strictly managed by the process you are developing?

John Rowan

======================================================
======================================================
Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

LightShip SSC-Addicted Points: 421 More actions · Answer 6

Thanks for all the replies gents.
The full flow of information is as follows :
1. Read the file in to initial load table, assign each file a batch ID
2. Push the new / different records to the staging table, bringing the batch id along. Records are sorted and assigned a rowed starting at 1 during the move from ETL table to staging table. (Each batch starts at 1). Existing older (different) versions of the record are ignored, but need to be retained in case of queries relating to older ETL processes.
3. All records with the batch id are written to a file which goes to another reporting system. Error report is generated indicating which row the error has occurred on (hence the unique rowed starting at 1 per batch.)
4. Records in the staging table which were flagged as errors by the external process are updated to indicate they were not accepted, remaining records from the same batch are flagged as accepted.
I'm looking at using MERGE statement, SSIS lookup and will take a look at "EXISTS or EXCEPT" as suggested above, would prefer to do it in TSQL than using package component (Lookup) based approach due to the fact its easier to release patches to TSQL code than a package.

Phil Parkin SSC Guru Points: 246985 More actions · Answer 7

DonkeyKing - Friday, November 17, 2017 6:22 AM
Thanks for all the replies gents.
The full flow of information is as follows :
1. Read the file in to initial load table, assign each file a batch ID
2. Push the new / different records to the staging table, bringing the batch id along. Records are sorted and assigned a rowed starting at 1 during the move from ETL table to staging table. (Each batch starts at 1). Existing older (different) versions of the record are ignored, but need to be retained in case of queries relating to older ETL processes.
3. All records with the batch id are written to a file which goes to another reporting system. Error report is generated indicating which row the error has occurred on (hence the unique rowed starting at 1 per batch.)
4. Records in the staging table which were flagged as errors by the external process are updated to indicate they were not accepted, remaining records from the same batch are flagged as accepted.
I'm looking at using MERGE statement, SSIS lookup and will take a look at "EXISTS or EXCEPT" as suggested above, would prefer to do it in TSQL than using package component (Lookup) based approach due to the fact its easier to release patches to TSQL code than a package.

While this is useful information, it does not answer either of the questions posed by John and me.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

LightShip SSC-Addicted Points: 421 More actions · Answer 8

Hi Phil, apologies, I thought I did. I guess you cant hear the voices in my head 😉question from you - "Wouldn't this scenario (ie, PK exists, but other columns have changed) result in an UPDATE, rather than an INSERT?"
Existing older (different) versions of the record need to be retained in case of queries relating to older ETL processes. To expand on the answer above, not all customer records are written out in each file, only updated or new records, but if I need to reproduce last March's file I need to retain that version of the record, along with other records from that batch.

question from John - "will other processes or a users be able to update this table " - answered in step4 above, where the separate process flagging records as accepted or not is the only thing updating the records once they're written. This process however will not update any of the columns brought in by the ETL process, and its not amended by anything else.

GaryV Hall of Fame Points: 3307 More actions · Answer 9

You said you have 2 columns which combine to make a primary key.

And from the newly staged data, any (or none) of 80 columns could change*. (The changes could be found using EXISTS or EXCEPT as noted above.) If any changed columns are found you want to create a new record, not update the existing record.

This would create a duplicate primary key, would it not?