Primary key / DTS load question

  • I'm loading data into a table via DTS from a text file. The table in question has a primary key. If one of the rows in the text file is a duplicate or any exist already in the table then all the rows fail to load.

    One way round this is to load the data into a staging table with no constraints.

    I then have a insert trigger on the main table. The trigger fetches rows from inserted into a cursor that examines if there is a duplicate in the main table. If there is then it loads it to a violation table, if not then it loads correctly to the main table.

    Problem with this is that performance takes a big hit if you are loading, say, any more than 50,000 rows.

    Anything I've read seems to sugggest you should avoid cursors wherever possible.

    Informix dynamic server used to(and may still!) do this for you with violation and diagnostic tables.

    Any ideas/best practice suggestions appreciated.

  •  
    
    -- Assuming, for simplicity sake, that your main table is defined like:

    CREATE TABLE MainTable (
    PKField INT PRIMARY KEY,
    OtherFields SQL_VARIANT /* or whatever.... */ NOT NULL
    )

    -- I would create a staging table with an additional identity field, thus:

    CREATE TABLE StagingData (
    IdentityField INT IDENTITY,
    PKField INT NOT NULL,
    OtherFields SQL_VARIANT /* or whatever.... */ NOT NULL
    )

    -- Then a view that eliminates duplicates within the staging table and
    -- also excludes staging data that duplicates rows already in the
    -- main table

    CREATE VIEW LoadableIDs
    AS
    SELECT LoadableID = MIN(IdentityField)
    FROM StagingData
    WHERE NOT EXISTS (SELECT PKField From MainTable WHERE MainTable.PKField = StagingData.PKField)
    GROUP BY PKField

    -- Then...
    --
    -- Load all text data to the staging table
    --
    -- copy all rows from [StagingData] to [ViolationTable]
    -- WHERE IdentityField NOT IN (SELECT LoadableID FROM LoadableIDs)
    --
    -- copy all rows from [StagingData] to [MainTable]
    -- WHERE IdentityField IN (SELECT LoadableID FROM LoadableIDs)




    Cheers,
    - Mark

  • I would add - just this:

    in DTS (at the end) run a store proc. that takes the unique fields from StagingData table and puts them -using that "exists" statement - into your finale table "MainTable"

    Regards

  • Also, you might look into the Transform Data Task Properties - Option tab. It has a couple of features that deal with errors during a transform. Click the help button for that tab. Specifically I'm referring to the "Max Error Count" and "Insert Batch Size" which in combo allow a certain number of errors to occur while still commiting the remainder of the records. It may not be the best way to go for you, but it is another option.

  • Use exception handling in DTS. SQL Server 2000 provides much better exception handling than SQL Server 7. I've run into this problem several times where a mainframe extract writes the same record more than once, because something changed while the mainframe extract was running.

  • This is the way I handle knowing whether or not the key value for the data in my text file is a duplicate of what is in the database, though I'm guessing I'll get dinged on the inefficiency of this approach.

    I use a DTS data driven query (DDQ), and I use the ActiveX script transformation. Our ETL strategy is to send the data from the production system to the warehouse whenever there's a change, which means there's always a chance of getting duplicates. If a duplicate key value is found in one of the ETL text files, it is handled as an update of the existing row rather than an exception, and if the key value in the data row is not found in the table, it is handled as an insert. The DDQ task allows you to process a text file one row at a time and conditionally insert, update, or delete records. Here's an example:

    CREATE TABLE Project (

    varchar(10) proj_id PRIMARY KEY,

    varchar(50) proj_desc

    )

    In the DDQ, I created a lookup named Project, which contains the following SQL:

    SELECT COUNT(*) AS Expr1

    FROM Project

    WHERE (proj_id = ?)

    As long as the database server maintains the integrity of the primary key, this lookup returns either 1 or 0.

    Then there is the following ActiveX script transform:

    '**********************************************************************

    ' Visual Basic Transformation Script

    '************************************************************************

    ' Copy each source column to the destination column

    Dim exists

    Function Main()

    exists = DTSLookups("Project").Execute(DTSSource("PROJ_ID"))

    If (exists = 1) Then

    DTSDestination("proj_id") = DTSSource("PROJ_ID")

    DTSDestination("proj_desc") = DTSSource("PROJ_DESC")

    Main = DTSTransformstat_UpdateQuery

    Else

    DTSDestination("proj_id") = DTSSource("PROJ_ID")

    DTSDestination("proj_desc") = DTSSource("PROJ_DESC")

    Main = DTSTransformstat_InsertQuery

    End If

    End Function

    So if the lookup returns 1, the PK exists already, and the code processes the data and updates the existing record. If it returns 0, the PK does not exist, and the code inserts the new record. You can modify this approach to skip rows and write them to your own exception file (DTSTransformStat_SkipRow is the return code constant that handles skipping), or you can use the exception handling routines within DTS (SQL Server 2000) as recommended in an earlier post.

    The final step for my example is to create both the Insert and the Update queries on the Query tab of the DDQ task, and set the appropriate parameters. The Wrox-published book "SQL Server 2000 DTS" (a plug for Mark Chaffin, Brian Knight, and Todd Robinson) was instrumental in teaching me what I learned about using the DDQ task.

    I would be interested in feedback from the initial poster and others about the efficiency (or lack thereof) of this approach. I have used this technique for my relatively small data warehousing application, and while I'm not setting any land speed records, the DTS processing time is a small percentage of the time spent in the overall ETL process -- most of it is spent in the cube processing that takes place after I'm done with DTS. A SELECT on the primary key (which is already indexed by virtue of being a primary key) per row is the cost of this approach, but I always know whether the row of data is to be inserted or updated.

    Feel free to email me if you have any questions or need clarification on the specifics of this use of the DDQ task (drmccue@uiuc.edu).

  • Thanks to all for your suggestions.

    mccorks view methodology has worked best here as performance hit is very small on large loads (200,000 - 2m).

    DDQ of drmccue works fine on anything under 1000 rows. Anythin above that and the individual processing of each row slows things right down.

  • Wow, almost 4 years later...

    I was reading this thread because it was a topic I searched for, and thought I would add my $.02:

    It's always MUCH better to use LEFT JOINs than NOT EXISTS. Especially in a situation like this where people are talking about 200,000-2M rows!


    Rick Todd

  • I agree... NOT EXISTS implies a correlated sub-query which is another form of RBAR (pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row").

    Now, the interesting part is that a WHERE NOT IN has slightly better performance than an outer join/null search but only, of course, if you only need to search a single column.

    On the original DTS problem about failing on the Primary Key dupe... define the Primary Key on the table with the IGNORE DUPES option and your problems are over... no staging table, no chance of duplicates in the target table, no batch failures if dupes exist in the source file.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yeah... me too... sorry you had to wait 4 years for the answer...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.p.s.  Why the hell are you using DTS for such a simple BULK INSERT task, anyway... you like "slow" or something

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply