June 6, 2003 at 5:15 am
I'm loading data into a table via DTS from a text file. The table in question has a primary key. If one of the rows in the text file is a duplicate or any exist already in the table then all the rows fail to load.
One way round this is to load the data into a staging table with no constraints.
I then have a insert trigger on the main table. The trigger fetches rows from inserted into a cursor that examines if there is a duplicate in the main table. If there is then it loads it to a violation table, if not then it loads correctly to the main table.
Problem with this is that performance takes a big hit if you are loading, say, any more than 50,000 rows.
Anything I've read seems to sugggest you should avoid cursors wherever possible.
Informix dynamic server used to(and may still!) do this for you with violation and diagnostic tables.
Any ideas/best practice suggestions appreciated.
June 6, 2003 at 8:16 pm
-- Assuming, for simplicity sake, that your main table is defined like:
CREATE TABLE MainTable (
PKField INT PRIMARY KEY,
OtherFields SQL_VARIANT /* or whatever.... */ NOT NULL
)
-- I would create a staging table with an additional identity field, thus:
CREATE TABLE StagingData (
IdentityField INT IDENTITY,
PKField INT NOT NULL,
OtherFields SQL_VARIANT /* or whatever.... */ NOT NULL
)
-- Then a view that eliminates duplicates within the staging table and
-- also excludes staging data that duplicates rows already in the
-- main table
CREATE VIEW LoadableIDs
AS
SELECT LoadableID = MIN(IdentityField)
FROM StagingData
WHERE NOT EXISTS (SELECT PKField From MainTable WHERE MainTable.PKField = StagingData.PKField)
GROUP BY PKField
-- Then...
--
-- Load all text data to the staging table
--
-- copy all rows from [StagingData] to [ViolationTable]
-- WHERE IdentityField NOT IN (SELECT LoadableID FROM LoadableIDs)
--
-- copy all rows from [StagingData] to [MainTable]
-- WHERE IdentityField IN (SELECT LoadableID FROM LoadableIDs)
Cheers,
- Mark
June 8, 2003 at 9:47 pm
I would add - just this:
in DTS (at the end) run a store proc. that takes the unique fields from StagingData table and puts them -using that "exists" statement - into your finale table "MainTable"
Regards
June 9, 2003 at 6:59 am
Also, you might look into the Transform Data Task Properties - Option tab. It has a couple of features that deal with errors during a transform. Click the help button for that tab. Specifically I'm referring to the "Max Error Count" and "Insert Batch Size" which in combo allow a certain number of errors to occur while still commiting the remainder of the records. It may not be the best way to go for you, but it is another option.
June 9, 2003 at 6:59 am
Use exception handling in DTS. SQL Server 2000 provides much better exception handling than SQL Server 7. I've run into this problem several times where a mainframe extract writes the same record more than once, because something changed while the mainframe extract was running.
June 9, 2003 at 3:48 pm
This is the way I handle knowing whether or not the key value for the data in my text file is a duplicate of what is in the database, though I'm guessing I'll get dinged on the inefficiency of this approach.
I use a DTS data driven query (DDQ), and I use the ActiveX script transformation. Our ETL strategy is to send the data from the production system to the warehouse whenever there's a change, which means there's always a chance of getting duplicates. If a duplicate key value is found in one of the ETL text files, it is handled as an update of the existing row rather than an exception, and if the key value in the data row is not found in the table, it is handled as an insert. The DDQ task allows you to process a text file one row at a time and conditionally insert, update, or delete records. Here's an example:
CREATE TABLE Project (
varchar(10) proj_id PRIMARY KEY,
varchar(50) proj_desc
)
In the DDQ, I created a lookup named Project, which contains the following SQL:
SELECT COUNT(*) AS Expr1
FROM Project
WHERE (proj_id = ?)
As long as the database server maintains the integrity of the primary key, this lookup returns either 1 or 0.
Then there is the following ActiveX script transform:
'**********************************************************************
' Visual Basic Transformation Script
'************************************************************************
' Copy each source column to the destination column
Dim exists
Function Main()
exists = DTSLookups("Project").Execute(DTSSource("PROJ_ID"))
If (exists = 1) Then
DTSDestination("proj_id") = DTSSource("PROJ_ID")
DTSDestination("proj_desc") = DTSSource("PROJ_DESC")
Main = DTSTransformstat_UpdateQuery
Else
DTSDestination("proj_id") = DTSSource("PROJ_ID")
DTSDestination("proj_desc") = DTSSource("PROJ_DESC")
Main = DTSTransformstat_InsertQuery
End If
End Function
So if the lookup returns 1, the PK exists already, and the code processes the data and updates the existing record. If it returns 0, the PK does not exist, and the code inserts the new record. You can modify this approach to skip rows and write them to your own exception file (DTSTransformStat_SkipRow is the return code constant that handles skipping), or you can use the exception handling routines within DTS (SQL Server 2000) as recommended in an earlier post.
The final step for my example is to create both the Insert and the Update queries on the Query tab of the DDQ task, and set the appropriate parameters. The Wrox-published book "SQL Server 2000 DTS" (a plug for Mark Chaffin, Brian Knight, and Todd Robinson) was instrumental in teaching me what I learned about using the DDQ task.
I would be interested in feedback from the initial poster and others about the efficiency (or lack thereof) of this approach. I have used this technique for my relatively small data warehousing application, and while I'm not setting any land speed records, the DTS processing time is a small percentage of the time spent in the overall ETL process -- most of it is spent in the cube processing that takes place after I'm done with DTS. A SELECT on the primary key (which is already indexed by virtue of being a primary key) per row is the cost of this approach, but I always know whether the row of data is to be inserted or updated.
Feel free to email me if you have any questions or need clarification on the specifics of this use of the DDQ task (drmccue@uiuc.edu).
June 12, 2003 at 3:37 am
Thanks to all for your suggestions.
mccorks view methodology has worked best here as performance hit is very small on large loads (200,000 - 2m).
DDQ of drmccue works fine on anything under 1000 rows. Anythin above that and the individual processing of each row slows things right down.
March 8, 2007 at 11:11 am
Wow, almost 4 years later...
I was reading this thread because it was a topic I searched for, and thought I would add my $.02:
It's always MUCH better to use LEFT JOINs than NOT EXISTS. Especially in a situation like this where people are talking about 200,000-2M rows!
Rick Todd
March 12, 2007 at 7:39 pm
I agree... NOT EXISTS implies a correlated sub-query which is another form of RBAR (pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row").
Now, the interesting part is that a WHERE NOT IN has slightly better performance than an outer join/null search but only, of course, if you only need to search a single column.
On the original DTS problem about failing on the Primary Key dupe... define the Primary Key on the table with the IGNORE DUPES option and your problems are over... no staging table, no chance of duplicates in the target table, no batch failures if dupes exist in the source file.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 12, 2007 at 7:40 pm
Yeah... me too... sorry you had to wait 4 years for the answer...
--Jeff Moden
Change is inevitable... Change for the better is not.
March 12, 2007 at 7:42 pm
p.p.s. Why the hell are you using DTS for such a simple BULK INSERT task, anyway... you like "slow" or something
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply