August 7, 2015 at 11:43 pm
I have a data flow that brings the data from flat file (.txt) to a table but after processing few rows, it fails with this error.
Violation of PRIMARY KEY constraint 'PK_tableprimarykey'. Cannot insert duplicate key in object 'dbo.table'.".
I want to send all the data that is duplicate to another table or flat file. How should I do it? what tool do I need to use to get the duplicates? please let me know. Thanks for your time.
August 8, 2015 at 12:19 am
studySQL (8/7/2015)
I have a data flow that brings the data from flat file (.txt) to a table but after processing few rows, it fails with this error.Violation of PRIMARY KEY constraint 'PK_tableprimarykey'. Cannot insert duplicate key in object 'dbo.table'.".
I want to send all the data that is duplicate to another table or flat file. How should I do it? what tool do I need to use to get the duplicates? please let me know. Thanks for your time.
You have some options, here are the four most common;
😎
a) Keep the constraint on the target table and clean up the data before loading.
The simplest solution but not always practical if one does not control the file creation process.
b) Add an intermediate import table without the constraint and run the deduplication off that table.
Recommend this solution, simple to implement and very robust.
c) Introduce a deduplication logic in the SSIS dataflow.
Avoid this if possible, would only recommend it if you cannot create objects or alter the destination in any way. Even if it seems simple, it can become quite complicated and resource intensive, i.e. duplications across batches means lookup in the whole destination set etc..
d) Use an instead of insert trigger in the destination table to move the duplicates to another table.
This is relatively straight forward but may affect performance, only recommend this if you have no control over other parts of the process.
August 10, 2015 at 11:30 am
Thank you.
August 10, 2015 at 11:54 am
studySQL (8/7/2015)
I have a data flow that brings the data from flat file (.txt) to a table but after processing few rows, it fails with this error.Violation of PRIMARY KEY constraint 'PK_tableprimarykey'. Cannot insert duplicate key in object 'dbo.table'.".
I want to send all the data that is duplicate to another table or flat file. How should I do it? what tool do I need to use to get the duplicates? please let me know. Thanks for your time.
One of the cardinal rules of ETL processing is to leave the imported data intact in a "staging table". Generally, that data is then copied to a "CLEAN" table (also without constraints), where it can have records marked as "error records". One then typically selects only NON-error records from the CLEAN table for insert into production. Both the staging and clean tables usually also get a "batch number" column to uniquely identify the import batch, and are free of any constraints or defaults, save for a possible IDENTITY field as a primary key. SSIS is definitely in the ETL category (Extract, Transform, and Load). FYI...
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
August 10, 2015 at 12:08 pm
sgmunson (8/10/2015)
studySQL (8/7/2015)
I have a data flow that brings the data from flat file (.txt) to a table but after processing few rows, it fails with this error.Violation of PRIMARY KEY constraint 'PK_tableprimarykey'. Cannot insert duplicate key in object 'dbo.table'.".
I want to send all the data that is duplicate to another table or flat file. How should I do it? what tool do I need to use to get the duplicates? please let me know. Thanks for your time.
One of the cardinal rules of ETL processing is to leave the imported data intact in a "staging table". Generally, that data is then copied to a "CLEAN" table (also without constraints), where it can have records marked as "error records". One then typically selects only NON-error records from the CLEAN table for insert into production. Both the staging and clean tables usually also get a "batch number" column to uniquely identify the import batch, and are free of any constraints or defaults, save for a possible IDENTITY field as a primary key. SSIS is definitely in the ETL category (Extract, Transform, and Load). FYI...
I don't agree with this being a cardinal rule. If you do this, then it's ELT, not ETL.
I agree it's an option, but not that it is a cardinal rule.
For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]
August 10, 2015 at 1:20 pm
Alvin Ramard (8/10/2015)
sgmunson (8/10/2015)
studySQL (8/7/2015)
I have a data flow that brings the data from flat file (.txt) to a table but after processing few rows, it fails with this error.Violation of PRIMARY KEY constraint 'PK_tableprimarykey'. Cannot insert duplicate key in object 'dbo.table'.".
I want to send all the data that is duplicate to another table or flat file. How should I do it? what tool do I need to use to get the duplicates? please let me know. Thanks for your time.
One of the cardinal rules of ETL processing is to leave the imported data intact in a "staging table". Generally, that data is then copied to a "CLEAN" table (also without constraints), where it can have records marked as "error records". One then typically selects only NON-error records from the CLEAN table for insert into production. Both the staging and clean tables usually also get a "batch number" column to uniquely identify the import batch, and are free of any constraints or defaults, save for a possible IDENTITY field as a primary key. SSIS is definitely in the ETL category (Extract, Transform, and Load). FYI...
I don't agree with this being a cardinal rule. If you do this, then it's ELT, not ETL.
I agree it's an option, but not that it is a cardinal rule.
I guess you could call it ELT, but then you are redefining what constitutes "load". Traditionally, the load portion is where you actually load the records into a production table, and the staging and clean tables getting loaded is actually the extract side of the process. I'm sure we could start a flame war over the semantics here, but as there's an industry full of ETL coders and processors, and I doubt it would suddenly cause them all to start using the term "ELT". Besides, ETL is largely based on the concept of taking something from one place to another, and because the two places are often not exactly alike, you have to "fix" some of the data to get it to go, and thus the "transform" part is almost always in the middle. Given the broad acceptance of the term, it's probably not going to go away any time soon. Heck, I've even run into folks who suggest that because so many staging tables have no data types other than varchar or nvarchar, that the transform is first, but again, it's semantics. I always just ask them how they got the data to load into the staging tables, and that usually settles the question.
The reason I consider it a "cardinal rule" is because for traceability, audit, and troubleshooting, any sizable environment is better served by never destroying the inputs. It gives you a table-based data source you can always go back to and see where data came from. If it's properly designed and implemented, you can usually trace any given data all the way back to it's source.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
August 10, 2015 at 3:00 pm
I have a colleague who seems to be a big fan of ELTLTTLTTTTL processes. It's just not enterprisey enough if there aren't multiple redundant copies of the data.
I prefer a guideline of attempting to make an ETL package smart enough to not have to store the row multiple times, or at least to minimize the number of inserts. I won't say "cardinal rule", because there will always be exceptions or conflicting requirements.
August 10, 2015 at 6:42 pm
Scott Coleman (8/10/2015)
I have a colleague who seems to be a big fan of ELTLTTLTTTTL processes. It's just not enterprisey enough if there aren't multiple redundant copies of the data.I prefer a guideline of attempting to make an ETL package smart enough to not have to store the row multiple times, or at least to minimize the number of inserts. I won't say "cardinal rule", because there will always be exceptions or conflicting requirements.
Keep in mind that many companies in the financial industry have little choice in the matter, given existing regulations. Besides, years enough ago, NOT destroying your inputs was considered prudent programming. There was as much of a good reason for it then, as there is now. Of course, having more than one original copy of the data in a table isn't necessary, and more than one copy of "cleaned" data is overkill. Having one staging table and one clean table is NOT overkill, but just a rather good solution to auditing, troubleshooting, and data management requirements / regs.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply