Incremental Update on Very Large Table

  • Jim -

    I've got to side with Jeff on this one - been there done that, even against 300-400 million row/200GB+ databases, it's possible to write distance calculations that return subsecond results.  Calculating distance between two points on the fly is definitely less expensive than trying to keep a multi-million row table updated, synchronized, etc.

    If your developers can define the algorithym they're using to tie these sites together sufficiently to build the table it's highly likely that the algorithym can be modified to calculate the same results on the fly.  Probably the hardest thing I've ever tried to explain to developers & Jr. DBA's is that the query optimizer is right/smarter than you are 99.00% of the time - and when it's wrong it's usually the DBA's fault (e.g out of date statistics) or due to a fundamental misunderstanding of how to work with sets (e.g. cursors, temp tables)

    Any chance you can get the developers to share their secret sauce?  Would it help to have examples of functions/queries that can/will calculate distance between two points on the fly?

    Joe

     

  • If you do the updates, I would have to assume that there would NOT be any reorganizing of the indexes except for new locations.  Which is why I think that should work better.  You don't have the distance indexed do you? 

    I suspect at this point your issue may be locks/blocking from the reads that are occurring, and not that the update is really that painful.  Consider using snapshots for your reads, or even going with snapshot isolation mode. 

    No matter how I look at it, I cannot see how doing the delete then an insert could gain you anything here.

  • We have a table with over 5 million rows ( soon to be over 20 million) that store nodes and the distance between them.

    First, I'm with everyone else, calculate as needed, if they can do the calculation, so can you.

    These values are re-calculated every hour and need to be re-inserted into this table every hour. However, on any given run, only a small percentage of the rows will change.

    Where is the data coming from? If possible, require that they only give you changed values, or at least mark the changed rows, then when you dump it into your staging tables you are only getting what you need.

    1. update T1 set colDist = ST.colDist from ST join T1 on ST.colA = T1.colA and ST.colB = T1.colB and ST.colDist <> T1.colDist

    2. delete from T1 where not exists ( select * from ST where colA = T1.colA and colB = T1.colB)

    3. insert T1 (colA, colB, colDist) select * from ST where not exists ( select * from T1 where colA = ST.colA and colB = ST.colB)

    Assuming that T1 is live table and ST is the staging table:

    1. DELETE from ST anything that hasn't changed:  T1.ColDist = ST.ColDist.  (Since no one is using ST but you it doesn't matter how long this takes)
    2. UPDATE Changed Rows:  T1.ColDist = ST.ColDist
    3. DELETE FROM ST those rows just updated:  T1.colDist = ST.ColDist
    4. INSERT Remaining rows from ST into T1

    Depending on your Performance you may want to swap 2 and 4.

    In order to limit locking you may want to use a loop like:

    SET RowCount 5000
    UPDate from ST
    WHILE @@RowCount > 0
    Update from ST
     
    Good Luck,
    JimFive
  • I didn't notice this originally but as ColDist is a float (consider Decimal/Number) it is possible that the above loop could become an infinite loop due to rounding issues.

    Also, you're update statement, while appropriate for T-SQL is not standard SQL.

    Standard SQL would be something like

    Update T1 SET ColDist = (SELECT ST.ColDist from ST where St.ColA = T1.ColA and ST.ColB = T1.ColB) WHERE EXISTS (SELECT * FROM ST WHERE T1.ColA = ST.ColA AND T1.ColB = ST.ColB and T1.ColDist <> ST.ColDist)

    I doubt it makes a performance difference though, (if anything, I would expect the standard SQL to be slower).

    --

    JimFive

  • I really don't like any of the responses that suggest switching between an active and a passive table while you reload the passive table.  As stated in the first post, most of the updates do not change the underlying data, so reloading everything is forcing volatility on to stable data.

    In order to be performant, I think you need to reduce the volatility to just those rows that need changing.  A good starting point for the design is the traditional transaction file / master file merge.

    Probably the best solution to this in SQL Server terms is the MERGE statement in SQL Server 2008.  It may be this facility will drive the business case for an early move to SQL Server 2008.

    In SQL 2000 or SQL 2005, I would look at 1) Create a staging table with your new data, 2) Create a driver table by doing a LOJ between your transaction table and master table, to store an action character of I (nsert), U (pdate), D (elete), N (o action).  You can then drive the required inserts, etc from this table.  The I, U, D code should be obvious.  The N code is where your transaction data exactly matches the master data, to avoid forcing SQL to do an update that does not change anything.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • He's gotta load all the data into a staging table, anyway... why not use it?  End result will be the same except the switch will be much, much faster...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi All,

    I've ended up using the incremental update solution with a staging table. It turns out that there are fewer than 5% of the rows that change at any given update. While this does take longer than building the view to reference one table and then switching the view to reference another, it is a much more simple approach. Fewer dependencies.

    Once this release is out, and i have some more time, i'm going to sit with the developer and get into the details of the distance calculations he's doing and see if i can move that into SQL.

    Jim

    Regards, Jim C

  • another 'why not use a view' :

    You cannot alter a view if someone is using it at that moment !

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Cool... thanks for the feedback, Jim.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply