Best way to check for existing record before an insert

Question

Best way to check for existing record before an insert

MyDoggieJessie

SSC-Forever

Points: 44486
More actions
April 11, 2011 at 9:38 am

#143792

I'm doing some ETL work and am getting flat-files from a Progress system which need to be loaded into SQL Server 2008. Obviously Im using SSIS but am wondering the best/fastest/mode reliable way to check if the record I'm attempting to insert already exists.
If it does, delete the record and reinsert the new one? Is there another way?
Any help ANYONE could provide would be greatly appreciated!
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

stevefromOZ SSC-Forever Points: 43646 More actions · Answer 1

Depending on your version of SQL (e.g. using 2008 R2, i think also 2008) you could use the MERGE statement, which is the SQL equivalent of an Oracle UPSERT (ie insert or update depending on existance).

Steve.

MyDoggieJessie SSC-Forever Points: 44486 More actions · Answer 2

This is basically The situation I have in my Data Flow:

Flat File

|

Derived Column

|

Data Conversion

|

SQL Server Destination

Are you saying something like:

Flat File

|

Derived Column

|

Data Conversion

|

Lookup

|-------------------------|

| Execute SQL task

| (Merge)

(New record)

SQL Server Destination

Anty thoughts on DELETing the record if exists rather than updating it?

Seems like more overheard to check each column for a change (especially if there's hundreds of tme) than to check if the UID column exists, delete and reinsert?

Thanks for patience and help!

______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

stevefromOZ SSC-Forever Points: 43646 More actions · Answer 3

Possibly? :hehe:

I guess based on what you've supplied below, I would say more like

Flat File

|

Derived Column

|

Data Conversion

|

Temp/Perm "work" table [SQL Server Destination]

|-------------------------|

| Execute SQL task

| Merge Statement (covers insert, update, delete)

ie, get your data to a table that you can then run as input to the merge statement

The way I read the help on MERGE, it definitely appears that on MATCHED condition, you havethe choice of running an UPDATE or DELETE.

Steve.

brian118 Hall of Fame Points: 3506 More actions · Answer 4

MyDoggieJessie (4/11/2011)
This is basically The situation I have in my Data Flow:
Flat File
|
Derived Column
|
Data Conversion
|
SQL Server Destination
Are you saying something like:
Flat File
|
Derived Column
|
Data Conversion
|
Lookup
|-------------------------|
| Execute SQL task
| (Merge)
(New record)
SQL Server Destination
Anty thoughts on DELETing the record if exists rather than updating it?
Seems like more overheard to check each column for a change (especially if there's hundreds of tme) than to check if the UID column exists, delete and reinsert?
Thanks for patience and help!

Have you considered the Slowly Changing Dimension component?

Regards
Brian Ellul
----------------------------------------------------------------------------------
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein -

Daniel Bowlin SSC-Dedicated Points: 34566 More actions · Answer 5

Working only in the dataflow I would probably just do a lookup against my destination table to determine if a record already exists. What to do about it depends on whether or not you are working in 2005 or 2008.

imani_technology SSCrazy Eights Points: 8200 More actions · Answer 6

Could the Slowly Changing Dimension transformation help in this situation?

Koen Verbeeck SSC Guru Points: 259115 More actions · Answer 7

If it is just for checking if a record exists to make sure you do not have duplicates, I would use a lookup component. Only keep the rows with no match, so you are getting only new records.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Elliott Whitlow SSC Guru Points: 102402 More actions · Answer 8

I largely agree with Koen, however, for that method to work without issue you have to know that there will be no duplicates coming out of the source. If you have a duplicate this method will intermittently fail. The reason is that the lookup occurs on the first one and it doesn't see it so it says do an insert, but before that insert occurs the second record is also looked up and isn't found so it thinks it is an insert too, and then boom. This is very common if they are both in the same buffer but can happen at other times.. I've gotten burned on this before..

CEWII

Koen Verbeeck SSC Guru Points: 259115 More actions · Answer 9

Elliott Whitlow (4/13/2011)
I largely agree with Koen, however, for that method to work without issue you have to know that there will be no duplicates coming out of the source. If you have a duplicate this method will intermittently fail. The reason is that the lookup occurs on the first one and it doesn't see it so it says do an insert, but before that insert occurs the second record is also looked up and isn't found so it thinks it is an insert too, and then boom. This is very common if they are both in the same buffer but can happen at other times.. I've gotten burned on this before..
CEWII

Ah, you are right, I forgot about that pitfall 🙂

@OP: you can also just use an "ordinary" SQL SELECT statement combined with a NOT EXISTS clause to fetch the data.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

MyDoggieJessie SSC-Forever Points: 44486 More actions · Answer 10

Seems like there would be a lot of overhead with that as I'd be comparing a single record from a flat-file against a table that has 50+ million rows?

I'll give it a try and see though!

______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Koen Verbeeck SSC Guru Points: 259115 More actions · Answer 11

MyDoggieJessie (4/13/2011)
Seems like there would be a lot of overhead with that as I'd be comparing a single record from a flat-file against a table that has 50+ million rows?
I'll give it a try and see though!

That's quite a bit of important information that you didn't give at the beginning... 🙂

I still would use the EXISTS clause. It stops searching right after it finds a match. All the other options search through the whole recordset.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

MyDoggieJessie SSC-Forever Points: 44486 More actions · Answer 12

Sorry to not have mentioned that prior.

So forigiving my ignorance here...how would I implement it?

I've got a File data flow source connected directly to my SQL Server Data Flow Destination

How do I place a WITH EXISTS check between these two controls and it know which column I am referring to?

______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 13

Hey Jessie.

Regarding deletion vs. updating, it will depend a bit on your data but if a record comes in on your primary key from your datasource, does it contain all necessary information or does your existing db have additional information added to it? The reason I ask is that will help you determine if a deletion/full insert is your best bet or if you want to do a 2 stage insert/update.

Updates from SSIS can't be done in bulk, they're one at a time updates. So, you'd have to push your failures over to an updating staging table.

What I have found seems to work best for me in these situations is a two stage process. First, I do a lookup on just the existing key (identity fields, or whatever you're using to determine a record is a previously existing one). Split at this process for success/fail. Fails mean it didn't previously exist, it continues on towards the insert process. Successes get shunted to a staging table for later processing. Now, on the insert process, any failure THERE also gets shunted to our staging table (IE: repeated identities in the same file, things like that).

Finally, a proc will do a bulk update for me on the remaining records.

A 50 million row cache for a lookup is going to eat a lot of memory no matter what you do. Even if you're just using an identity field from the original source, that's an int. 50m x 4b = 200 million bytes = around 200 megs. For one int. Get into a couple of varchar business keys and this might go off the charts.

The other option is to use SSIS to simply dump the entire file to a staging table, and use T-SQL and index mechanics to help you locate the proper process on a per row basis. You can even batch this to 10,000 rows from the source file at a time if you chose. Once you've determine the proper operation, you can then do your 2 passes on the target table with updates and inserts.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Koen Verbeeck SSC Guru Points: 259115 More actions · Answer 14

MyDoggieJessie (4/13/2011)
Sorry to not have mentioned that prior.
So forigiving my ignorance here...how would I implement it?
I've got a File data flow source connected directly to my SQL Server Data Flow Destination
How do I place a WITH EXISTS check between these two controls and it know which column I am referring to?

Right, I forgot the source is a flat file.

I would follow Craig's advice: import the flat file to a staging table and then solve the problem using TSQL and indexes.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP