Importing/Updating data to an existing table

  • Oh brilliant minds, young grasshopper seeks your guidance:

     

    Nightly, data is exported from a non-SQL Server database to a text file.  Then, via DTS, the data is imported and appended to an existing table in a SQL Server 2000 database.   Sometimes the data contains new records; sometimes the data is meant to update existing records.  However, the data is merely appending to the existing data so this results in duplicate records or multiple records with the same primary field.  (The SQL Server database does not have any primary keys).

     

    How can I set up the import so:

    1) If the primary field exists, then the existing data is updated.

    2) If the primary field does not exist, then the new data is appended (inserted) to the table.

     

    Can this be done directly through the import job?  Or should the data be first imported into a disposable (temp) table and then imported into the existing table?

     

    David

     

     

  • Hi David,

     

    Create stored procedure that checks whether the primary key and do the update or insert to the table. Then in the DTS designer, you import the text file into temp table then call your stored procedure. It will be different story, if you have the ETL tools software such as Data Mirror or Informatica.

     

    Hopefully it answers your question and it may not be the only solution.

     

    Cheers

  • you might be able to do this with a data pump, but I'd follow Lanny's advice. It's easy and quick. Import into a staging table and set up a proc for the last step.

    DTS:

    - truncate staging table

    - import data

    - call proc

    The proc should do an update on PK matches and then an insert of what doesn't match. Or delete the matches after the update and insert everything else.

  • A Data Driven Query task will do what you want, but it's slow because it processes the source data row by row.  I prefer the staging table/stored procedure approach mentioned above and have used it many times.

    Greg

    Greg

  • If the nightly extract contains all the information you need in the table rather then writing proc's for applying all the changes (Insert, Update, Delete) just wipe out the table and insert all the new data. This way any changes that needed to be applied are done in one step.


    Cheers,

    Ben Sullins
    bensullins.com
    Beer is my primary key...

  • Thanks for the help. Can you provide any how-to steps?

  • Add an Identity column and delete the dupes when the import is done... much faster than trying to build an "UPSERT" (merge) in SQL Server 2000.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I can delete the duplicate records, but what I need to do is delete the current record in favor of the imported record when the primary field already exists in the table.  The imported record is not necessarily a duplicate; the import may just result in a 2nd record with a duplicate primary field with the remaining fields different.  Perhaps I could add an IDENITY column and simply delete a record where the primary field of the current table equals the primary field of the record to be imported and where the IDENTITY column is less than that of the incoming record.

     

    However, the imported data does not have an IDENTITY column and is originating from a non-SQL Server database.  So would this then have to be a two part process – 1st import the data and then delete the older record?  Or can this all be incorporated into DTS or one script?

  • The best answer was the three-stage DTS procedure. The proc would actually be simple -- an update and an insert statement.

    Let's say your SQL Server table has an identity column and four data columns (ID, C1, C2, C3, C4) and the staging table just has the data columns. Also, that C1 and C2 form the natural key -- these fields identify a duplicate row. So the proc would first update the non-key fields from duplicates:

    update dbo.OriginalTable 
    set C3 = ST.C3, C4 = ST.C4
    from dbo.OriginalTable as OT 
      inner join dbo.StagingTable as ST
        on OT.C1 = ST.C1 and OT.C2 = ST.C2;
    

    Then insert those rows from the staging table that are not duplicates:

    insert into OriginalTable (C1, C2, C3, C4)
    select C1, C2, C3, C4
    from StagingTable as ST
    where not exist (
        select *
        from OriginalTable as OT
        where OT.C1 = ST .C1 and OT.C2 = ST.C2);
    

    That's it. Of course, it would help to index the key field(s) of both tables.

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

  • Tomm's code for an upsert is good... think I'd do it in a stored procedure rather than DTS, but it'll do the trick either way.

    I still use BULK INSERT and a dupe delete, though... speed freak, I guess.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for all the input. I think I got it! Thanks again everyone.

  • What did you end up with, David?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I'm just gonna let the nightly import via DTS run and then set up a job to check for and delete the older record. This is the stored procedure that will run after the import.

    CREATE PROCEDURE nwDuplicateCheckerAll_Title

    AS

    SET NOCOUNT ON

    BEGIN TRANSACTION DuplicateChecker

    DELETE A

    FROM All_Title A

    INNER JOIN All_Title B

    ON A.KeyID = B.KeyID

    AND A.[DateAdd] = B.[DateAdd]

    WHERE A.LastColumn < B.LastColumn

    COMMIT TRANSACTION DuplicateChecker

    SET NOCOUNT OFF

    I'm treating the KeyID and DateAdd as a composite primary key. I added an identity column - "LastColumn." How does it look?

  • So, if I read this correctly, you are inserting all the records into the table, then going through and deleting the older duplicates. This seems straightforward but I have a few questions.

  • You say you are "treating" KeyID and DateAdd as a composite key. Are they the key or not? Does KeyID normally serve as the primary key? If so, why are you keeping records with the same KeyIDs but different DateAdds?
  • What is the primary key for All_Title? How Why do you disable it to allow for duplicates, even if for a moment?
  • From the time the duplicate records are entered until the time they are removed, the table is in an inconsistent state -- where the table is outside the scope of the definition of the entity it is supposed to contain. Even if this is only momentary... and you don't foresee anything going wrong between the two operations -- wouldn't it be better if the table was never in that state at all?
  • You have apparently created an identity column just for the purpose of facilitating this operation. Personally, I don't like doing that. I make sure such "utility" data is absolutely necessary and, even when it is, I don't put it in the same table. That makes utility information part of the definition of the entity and it's not. Even Date_last_modified and User_who_modified information should be placed in separate utility tables designed for that purpose.
  • You've had at least a couple of suggestions here that don't raise the concerns I've listed (I'm making some assumptions on the Bulk Insert idea). It's your database, of course, but you asked.

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply