The best way to load big chunks of rolling data

Question

The best way to load big chunks of rolling data

valeryk2000

SSCarpal Tunnel

Points: 4245
More actions
April 20, 2018 at 8:50 am

#344839

We have monthly loads of a THREE month of data: say, in April it's Jan, Feb, and Mar, in May - Feb, Mar, and Apr. This is because at the day of loads not all data for the latest month are complete, some records are missing, other may be updated, and so on (these are medical data on patient visits, procedures, lab, etc.). We need to add these data to the master table. Actually there are more than 20 table to be loaded
The strategy is as follows:
1. we find the minimum date in the 'load' file
2. delete all records from the 'master' table after this date - it may be from200,000 records up to 1,000,000.
3. insert the entire load table into the 'master' table
HOWEVER deleting will add a lot to the log.
Another approach:
1. we find the minimum date in the 'load' file
2. insert all records before this date from the 'master' to #temp
3. truncate the 'master' table
4. insert into master table everything from both '#temp' and 'load' tables
While this plan looks better for the log file we do not feel easy about truncating the 'master' table.
Can you suggest the best way?
Thanks
.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 1

valeryk2000 - Friday, April 20, 2018 8:50 AM
We have monthly loads of a THREE month of data: say, in April it's Jan, Feb, and Mar, in May - Feb, Mar, and Apr. This is because at the day of loads not all data for the latest month are complete, some records are missing, other may be updated, and so on (these are medical data on patient visits, procedures, lab, etc.). We need to add these data to the master table. Actually there are more than 20 table to be loaded
The strategy is as follows:
1. we find the minimum date in the 'load' file
2. delete all records from the 'master' table after this date - it may be from200,000 records up to 1,000,000.
3. insert the entire load table into the 'master' table
HOWEVER deleting will add a lot to the log.
Another approach:
1. we find the minimum date in the 'load' file
2. insert all records before this date from the 'master' to #temp
3. truncate the 'master' table
4. insert into master table everything from both '#temp' and 'load' tables
While this plan looks better for the log file we do not feel easy about truncating the 'master' table.
Can you suggest the best way?
Thanks
.

Partition the table by month. Use SWITCH to nearly instantly drop the old monthly partitions and add the new ones.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 2

We had, until recently, a weekly load of around 50 million rows into a table of 12,000,000,000 rows. It was taking about 8 hours, much of which was rebuilding the two ordinary indexes. Changing this to partition switching brought the time down to about 20 minutes, including rebuilding the indexes on the partition.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 3

ChrisM@Work - Friday, April 20, 2018 9:48 AM
We had, until recently, a weekly load of around 50 million rows into a table of 12,000,000,000 rows. It was taking about 8 hours, much of which was rebuilding the two ordinary indexes. Changing this to partition switching brought the time down to about 20 minutes, including rebuilding the indexes on the partition.

If you were to also make the change from partitioned tables to partitioned views, you could have different indexes based on how old the data was. You can also more easily transfer just a couple of partitions to a test system than you can with partitioned tables (which is impossible to do in a straight forward manner).

Either way, I agree that partitioning is the way to go for something like this and it doesn't just help the load/delete processes. If the partitions become "static" in nature after a month or two, you can set such partitions to READ_ONLY, take one final backup, and never have to back them up again. I've got a single table 1.1TB database with such data in it and it contains monthly partitions since May of 2010. It takes 0-6 minutes to back up depending on where we're at for the current month because I only have to backup the current month and the next month and the next month is always empty. 🙂

This is the table I wish I had used a partitioned view on instead of partitioning it as a table.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)