Advice - data cleaning after import

  • I import a couple of text files (CSV) into a temporary table from a DOTNET application using the SQLBulkCopy command. This all works fine, but after the import I need to perform data cleaning operations and it's this part that I'm trying to develop. The operations given are below.

    How should I perform them? Should I make separate calls to stored procedures from the app or run a series stored procedures one after the other from within SQL Server? We're talking of maybe a few hundred records here, so not vast quantities of data. The reason I'm considering running the processes separately for the two text files is that the 'where' are different in both cases.

    Thanks for any advice.

    [ed]I'm not asking for advice on how to create the queries, but how they should be run.[/ed]

    1) Delete Null Record (for some reason the SQLBulkCopy command inserts an additional empty row into the table. Think it might have something to do with the text file not having a header row).

    2) Update TempTestType to 'Combined'

    3) DELETE FROM TABLE

    WHERE Result Like '+'

    Or Result Like 'T'

    Or Result Like 'U'

    4)

    INSERT INTO (Combined)

    WHERE Hospital='00003006'

    OR tblPersonalDataTemp.Hospital='00003056'

    OR tblPersonalDataTemp.Hospital='00003058'

    OR tblPersonalDataTemp.Hospital='00003059'

    OR tblPersonalDataTemp.Hospital='00003060'

    OR tblPersonalDataTemp.Hospital='00003064'

    OR tblPersonalDataTemp.Hospital='00003065'

    OR tblPersonalDataTemp.Hospital='00003066'

    OR tblPersonalDataTemp.Hospital='00003067'

    OR tblPersonalDataTemp.Hospital='00003068'

    OR tblPersonalDataTemp.Hospital='00003069'

    5) Delete From Temporary Table

    6) Repeat for other text file

  • Instead of moving all the data into the temporary table, then running the processing against it, I'd suggest you put a WHERE clause on the Bulk copy so you don't even move data that you're just going to delete anyway. For the UPdATE, modify that so that it captures the write value during the load instead of processing a second time. As much as is practicable when using SQL SErver, stick to batch operations for things (it's not always an option). Where you can't, then you run the additional processing steps using stored procedures. That'll make for more efficient processing.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (5/11/2016)


    Instead of moving all the data into the temporary table, then running the processing against it, I'd suggest you put a WHERE clause on the Bulk copy so you don't even move data that you're just going to delete anyway. For the UPdATE, modify that so that it captures the write value during the load instead of processing a second time. As much as is practicable when using SQL SErver, stick to batch operations for things (it's not always an option). Where you can't, then you run the additional processing steps using stored procedures. That'll make for more efficient processing.

    Thanks for the advice!

  • M Joomun (5/11/2016)


    I import a couple of text files (CSV) into a temporary table from a DOTNET application using the SQLBulkCopy command. This all works fine, but after the import I need to perform data cleaning operations and it's this part that I'm trying to develop. The operations given are below.

    How should I perform them? Should I make separate calls to stored procedures from the app or run a series stored procedures one after the other from within SQL Server? We're talking of maybe a few hundred records here, so not vast quantities of data. The reason I'm considering running the processes separately for the two text files is that the 'where' are different in both cases.

    Thanks for any advice.

    [ed]I'm not asking for advice on how to create the queries, but how they should be run.[/ed]

    1) Delete Null Record (for some reason the SQLBulkCopy command inserts an additional empty row into the table. Think it might have something to do with the text file not having a header row).

    2) Update TempTestType to 'Combined'

    3) DELETE FROM TABLE

    WHERE Result Like '+'

    Or Result Like 'T'

    Or Result Like 'U'

    4)

    INSERT INTO (Combined)

    WHERE Hospital='00003006'

    OR tblPersonalDataTemp.Hospital='00003056'

    OR tblPersonalDataTemp.Hospital='00003058'

    OR tblPersonalDataTemp.Hospital='00003059'

    OR tblPersonalDataTemp.Hospital='00003060'

    OR tblPersonalDataTemp.Hospital='00003064'

    OR tblPersonalDataTemp.Hospital='00003065'

    OR tblPersonalDataTemp.Hospital='00003066'

    OR tblPersonalDataTemp.Hospital='00003067'

    OR tblPersonalDataTemp.Hospital='00003068'

    OR tblPersonalDataTemp.Hospital='00003069'

    5) Delete From Temporary Table

    6) Repeat for other text file

    I would advise to collect the data into a staging table first then when you perform any process against that table save those changes to yet another table.

    This will allow you stages in which you can go back incrementally should some future process fail. This will also give you comparison tables to look at should any doubt, about whether your data scrubbing is working or not, arise.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply