Deleting large batches of rows - optimum batch size?

  • Eric M Russell (10/19/2015)


    it's indicative of poor ETL processing or poor data modeling. I mean, why insert something and then delete it?

    We have interfaces with other systems where data is purely transient - once the data has been properly handled by the receiving system, the data in the tables used for the exchanges is no longer useful. Why wouldn't we delete the data when no longer needed?

    Sometimes hours or even days can elapse before the interface packages are confirmed as received and accepted, so we have nightly jobs that do the cleanup.

  • Kristen-173977 (10/20/2015)


    Lynn Pettis (10/19/2015)


    If deleting 90% of a table, SELECT * INTO to create a temporary table of data to retain, build the appropriate indexes on the temporary table, truncate the original table, A LTER TABLE SWITCH to move the data back to the original table.

    I'm curious how this compares with New Table and Rename?

    Downtime is from start of SELECT * INTO to the end of the SWITCH

    With rename downtime is from start of SELECT * INTO to the end of the RENAME

    Both cases have all the same issues (presumably) with drop/create indexes, constraints and FKeys. Anything else "object" to having, for example, a table DROPped and then re-appear after a rename? I don't think VIEWs need refreshing, do they?

    This is where testing comes to play. For the DB data cut the only thing I needed on the temporary table was the indexes that existed on the source table. The default constraints didn't need to be created on the temporary table. The database doesn't have any foreign keys (not my design and sort of stuck with it).

    The one thing about it is that the object_id doesn't change using this method, which also means you still have the created/modified datetimes in the sys.tables system view.

    The SWITCH and RENAME are both metadata changes.

  • Kim Crosser (10/20/2015)


    Eric M Russell (10/19/2015)


    it's indicative of poor ETL processing or poor data modeling. I mean, why insert something and then delete it?

    We have interfaces with other systems where data is purely transient - once the data has been properly handled by the receiving system, the data in the tables used for the exchanges is no longer useful. Why wouldn't we delete the data when no longer needed?

    Sometimes hours or even days can elapse before the interface packages are confirmed as received and accepted, so we have nightly jobs that do the cleanup.

    Why import it into SQL Server, rather than staging the records to flat files which can be archived or deleted when no longer needed?

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (10/20/2015)


    Why import it into SQL Server, rather than staging the records to flat files which can be archived or deleted when no longer needed?

    The database tables are used to control the generation and tracking of the status of the packages. Packages can be in various states - "ready to build XML package", "package built, awaiting transmission", "transmitted, awaiting receipt confirmation", "receipt confirmed".

    Various different events in the application program can cause the application to post a record to the database requesting a package to be built and sent. A separate application service periodically checks the status and builds and posts the XML packages. Another service picks up prepared XML packages and transfers them to the external system through web services on an intermediate server. A separate service receives responses from the external server and marks packages as accepted or rejected. Then finally a nightly job cleans out all the packages that have been successfully accepted by the external system.

    A system administrator can control the interface through simple SQL utilities to request packages to be retransmitted, or regenerated and retransmitted, when the receiving system has problems. Sometimes network and/or remote server problems delay receipt confirmation for days, and sometimes changes to the remote system result in mass rejections of packages, so we have to get them to fix their system, then retransmit everything that was rejected, which could be tens of thousands of packages.

    Why do you think flat files would be preferred over BLOB data in SQL Server? I used to think that separate flat files was a better approach, but somewhere around SQL 2005 all the BLOB handling issues seem to have been fixed, and keeping the data in the database with the reference information (not necessarily in the same table, but in the same transaction) means that you cannot get caught with database pointers to missing files, or flat files with no database references.

    BTW - we can't use SQL FileTables, as we have many agencies using SQL 2008 still, and that didn't appear until 2012. And (open question - enquiring minds want to know) would FileTables be preferred over storage in an XML or Varbinary field?

Viewing 4 posts - 16 through 18 (of 18 total)

You must be logged in to reply to this topic. Login to reply