Massive Inserts

  • Currenlty I have huge amounts of data going into a table.

    I'm sending an xmldoc and using openxml with a cursor to seed them.

    the question I have is whether to let duplicate keyed data rows bounce

    and then check @@error and then do an update on the nokeyed field

    or

    to do a select on the keyed field and then do an insert or update based on the

    selects results.

    Speed is my goal.

  • I expect it depends primarily on the percentage of rows you expect to fail due to a unique key constraint. The second most significant factor I suspect would be the number and type of indexes and the order in which they are processed. I don't know what SQL Server does first on an INSERT. Whether it allocates data space first, checks the primary key, checks other unique constraints, or something else. But, since both the primary key and alternate unique constraints can cause the insert to fail the number of such indexes will make SQL Server take longer to identify the error and increase the number of changes it needs to back out.

    As a guess I would think a failed insert would take twice as long as a successful one. If you figure an update to take about the same amount of time as an insert and a select to take 3/4 the time of an insert then the time would be as follows where "t" = time to perform an insert.

    Successful insert = 1t

    Failed insert with update = 3t

    Select with Insert = 1.75t

    Select with Update = 1.75t

    If x = the percentage of rows being inserted that require an UPDATE:

    100 rows total time with select = ( 1 - x ) * 1.75t + x * 1.75t = 1.75t

    100 rows total time without select = ( 1 - x ) * 1t + x * 3t = t + 2xt

    Solving 1.75t = t + 2xt gives x = 0.375 meaning the two approaches are equally efficient when 37.5% of the rows require an update. Any percentage lower than that would perform better using the failed insert approach.

    Since I made some major guesses earlier on that number isn't necessarily very accurate and the relative numbers would depend upon the DDL, amount of data, and distribution of data. I suspect that I chose pessimistic numbers and that the equality percentage is actually higher.

    If anyone has actually determined realistic numbers for a real world situation I'd be interested in them. I'd also be interested in knowing what steps SQL Server takes to implement an INSERT (will it cause a page split before identifying a constraint violation).

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply