April 11, 2011 at 9:38 am
I'm doing some ETL work and am getting flat-files from a Progress system which need to be loaded into SQL Server 2008. Obviously Im using SSIS but am wondering the best/fastest/mode reliable way to check if the record I'm attempting to insert already exists.
If it does, delete the record and reinsert the new one? Is there another way?
Any help ANYONE could provide would be greatly appreciated!
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
April 11, 2011 at 10:28 am
Depending on your version of SQL (e.g. using 2008 R2, i think also 2008) you could use the MERGE statement, which is the SQL equivalent of an Oracle UPSERT (ie insert or update depending on existance).
Steve.
April 11, 2011 at 2:11 pm
This is basically The situation I have in my Data Flow:
Flat File
|
Derived Column
|
Data Conversion
|
SQL Server Destination
Are you saying something like:
Flat File
|
Derived Column
|
Data Conversion
|
Lookup
|-------------------------|
| Execute SQL task
| (Merge)
(New record)
SQL Server Destination
Anty thoughts on DELETing the record if exists rather than updating it?
Seems like more overheard to check each column for a change (especially if there's hundreds of tme) than to check if the UID column exists, delete and reinsert?
Thanks for patience and help!
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
April 11, 2011 at 3:01 pm
Possibly? :hehe:
I guess based on what you've supplied below, I would say more like
Flat File
|
Derived Column
|
Data Conversion
|
Temp/Perm "work" table [SQL Server Destination]
|-------------------------|
| Execute SQL task
| Merge Statement (covers insert, update, delete)
ie, get your data to a table that you can then run as input to the merge statement
The way I read the help on MERGE, it definitely appears that on MATCHED condition, you havethe choice of running an UPDATE or DELETE.
Steve.
April 12, 2011 at 9:03 am
MyDoggieJessie (4/11/2011)
This is basically The situation I have in my Data Flow:Flat File
|
Derived Column
|
Data Conversion
|
SQL Server Destination
Are you saying something like:
Flat File
|
Derived Column
|
Data Conversion
|
Lookup
|-------------------------|
| Execute SQL task
| (Merge)
(New record)
SQL Server Destination
Anty thoughts on DELETing the record if exists rather than updating it?
Seems like more overheard to check each column for a change (especially if there's hundreds of tme) than to check if the UID column exists, delete and reinsert?
Thanks for patience and help!
Have you considered the Slowly Changing Dimension component?
Regards
Brian Ellul
----------------------------------------------------------------------------------
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein -
April 12, 2011 at 9:44 am
Working only in the dataflow I would probably just do a lookup against my destination table to determine if a record already exists. What to do about it depends on whether or not you are working in 2005 or 2008.
April 12, 2011 at 3:53 pm
Could the Slowly Changing Dimension transformation help in this situation?
April 13, 2011 at 3:32 am
If it is just for checking if a record exists to make sure you do not have duplicates, I would use a lookup component. Only keep the rows with no match, so you are getting only new records.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
April 13, 2011 at 11:05 am
I largely agree with Koen, however, for that method to work without issue you have to know that there will be no duplicates coming out of the source. If you have a duplicate this method will intermittently fail. The reason is that the lookup occurs on the first one and it doesn't see it so it says do an insert, but before that insert occurs the second record is also looked up and isn't found so it thinks it is an insert too, and then boom. This is very common if they are both in the same buffer but can happen at other times.. I've gotten burned on this before..
CEWII
April 13, 2011 at 1:15 pm
Elliott Whitlow (4/13/2011)
I largely agree with Koen, however, for that method to work without issue you have to know that there will be no duplicates coming out of the source. If you have a duplicate this method will intermittently fail. The reason is that the lookup occurs on the first one and it doesn't see it so it says do an insert, but before that insert occurs the second record is also looked up and isn't found so it thinks it is an insert too, and then boom. This is very common if they are both in the same buffer but can happen at other times.. I've gotten burned on this before..CEWII
Ah, you are right, I forgot about that pitfall 🙂
@OP: you can also just use an "ordinary" SQL SELECT statement combined with a NOT EXISTS clause to fetch the data.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
April 13, 2011 at 1:51 pm
Seems like there would be a lot of overhead with that as I'd be comparing a single record from a flat-file against a table that has 50+ million rows?
I'll give it a try and see though!
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
April 13, 2011 at 2:15 pm
MyDoggieJessie (4/13/2011)
Seems like there would be a lot of overhead with that as I'd be comparing a single record from a flat-file against a table that has 50+ million rows?I'll give it a try and see though!
That's quite a bit of important information that you didn't give at the beginning... 🙂
I still would use the EXISTS clause. It stops searching right after it finds a match. All the other options search through the whole recordset.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
April 13, 2011 at 3:32 pm
Sorry to not have mentioned that prior.
So forigiving my ignorance here...how would I implement it?
I've got a File data flow source connected directly to my SQL Server Data Flow Destination
How do I place a WITH EXISTS check between these two controls and it know which column I am referring to?
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience
April 13, 2011 at 3:55 pm
Hey Jessie.
Regarding deletion vs. updating, it will depend a bit on your data but if a record comes in on your primary key from your datasource, does it contain all necessary information or does your existing db have additional information added to it? The reason I ask is that will help you determine if a deletion/full insert is your best bet or if you want to do a 2 stage insert/update.
Updates from SSIS can't be done in bulk, they're one at a time updates. So, you'd have to push your failures over to an updating staging table.
What I have found seems to work best for me in these situations is a two stage process. First, I do a lookup on just the existing key (identity fields, or whatever you're using to determine a record is a previously existing one). Split at this process for success/fail. Fails mean it didn't previously exist, it continues on towards the insert process. Successes get shunted to a staging table for later processing. Now, on the insert process, any failure THERE also gets shunted to our staging table (IE: repeated identities in the same file, things like that).
Finally, a proc will do a bulk update for me on the remaining records.
A 50 million row cache for a lookup is going to eat a lot of memory no matter what you do. Even if you're just using an identity field from the original source, that's an int. 50m x 4b = 200 million bytes = around 200 megs. For one int. Get into a couple of varchar business keys and this might go off the charts.
The other option is to use SSIS to simply dump the entire file to a staging table, and use T-SQL and index mechanics to help you locate the proper process on a per row basis. You can even batch this to 10,000 rows from the source file at a time if you chose. Once you've determine the proper operation, you can then do your 2 passes on the target table with updates and inserts.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
April 13, 2011 at 11:39 pm
MyDoggieJessie (4/13/2011)
Sorry to not have mentioned that prior.So forigiving my ignorance here...how would I implement it?
I've got a File data flow source connected directly to my SQL Server Data Flow Destination
How do I place a WITH EXISTS check between these two controls and it know which column I am referring to?
Right, I forgot the source is a flat file.
I would follow Craig's advice: import the flat file to a staging table and then solve the problem using TSQL and indexes.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply