October 20, 2008 at 11:13 am
I have a situation where I need to move a huge number of records from one set of tables into another set of tables (within the same database). The data was given to me in the form of some huge CSV files and I have now imported the data from the CSVs into holding tables. I now need to get the data from the holding tables into the live tables.
Unfortunately this isn't just a straight movement of records. For each of these tables we are doing Update/Inserts: Updating records if they are already present in the destination tables (matching on code or ID fields) and Inserting records that aren't present in destination tables. Some of the update/inserts involve Inner Joins between two or three tables. Also, because some of the data in the source tables is duplicated (as far as the destination tables are concerned) I'm having to run several of the updates like this:
UPDATE myDestTable
SET
dField1 = s.sField1,
dField2 = s.sField2
FROM
myDestTable
INNER JOIN
(SELECT distinct
sField1,
sField2
FROM
mySourceTable) as s
on
myDestTable.itemcode = s.itemcode
I am running a series of 16 commands and some of these tables have >7 million records - the script takes ages. My problem is that I get the following messsage after around 7.5 hours of the script running:
Could not allocate a new page for database 'TEMPDB'
because of insufficient disk space in filegroup 'DEFAULT'.
Create the necessary space by dropping objects in the
filegroup, adding additional files to the filegroup, or
setting autogrowth on for existing files in the filegroup.
I gather that this is the standard message when the tempdb fills up and I guess the reason for this is that my commands are moving so much data about. The drive that holds the tempdb.mdf file has 66GB free before I start the operation, but is out of space when it fails. I am at the testing stage at the moment, but once I've got this working it'll be a one-off operation in the live environment. Could someone give me some advice on this? Should I just break the operation down into smaller sets of commands, shrinking down the tempdb log files after each one completes? As you can tell I don't have a great of experience with SQL operations of this magnitude... Thanks,
Tom
October 20, 2008 at 11:43 am
you may have some better luck if you remove the derived tables. If I remember correctly, when using a derived table as you are in your join statement in the example code, once it gets to the point where it can't be held in memory any longer it spills into temp db. Why not just do an inner join directly to the table, but only use the columns you need? Also remember that tempdb handles all of your column sorts and order bys and such. IF you can reduce that a bit you may find you'll use TempDB less and won't fill your drive.
OF course since this is a one off import breaking it up into chunks that can easily be processed will most likely yield you better results. Even splitting your import into chunks could get you better results, Import say 200,000 records at a time. I've seen examples done where importing 100,000 rows in 10 segments can yield much better results than the whole 1 mil all at the same time. There are a number of articles on this site which would point you in the right direction for altering your procedure in that manner.
-Luke.
October 21, 2008 at 4:46 am
Hi Luke,
Thanks for your advice. I think I need the derived tables because there are sometimes 30 or more rows in the source table where I need just a single row in the destination table. So I need to use the DISTINCT command to ensure that I don't import duplicates. But I'm sure you're right that this is the area that is causing the problem.
I'm currently trying to break the full script down into smaller chunks, monitoring the disk space as I go. I guess another approach would be to delete the duplicate rows from the source table first, then I could do away with the DISTINCT subquery. Cheers,
Tom
October 21, 2008 at 4:58 am
Another best practice in doing frequent imports of that data-size is:
First check what targetrows have to be changed or have to be inserted
(push them into a worktable via
select tgt.PrimaryKeyColumn, src.anyColumnsYouNeed
into workTable
from sourcetable src
left outer join targettable tgt
on tgt.someColumn=src.someColumn
where
(src.SomeOtherColumn <> tgt.SomeOtherColumn)
OR (src.SomeEvenOtherColumn <> tgt.SomeEvenOtherColumn)
OR (src.someColumn is NULL)
)
Then update only the targetrows that have to be changed
and insert the new ones from the worktable.
This prevents unnecessary updates on the majority of targetrows
which makes even more sense, if there are DML Triggers firing
on the targettable.
devloping robust and performant databaseapplications with Microsoft SQL-Server
October 21, 2008 at 6:28 am
Removing the dups would probably be a good idea as that x number rows less you have to deal with. You could even do something like a Select Distinct INTO a worktable as Age2uN stated about the difference between inserts and updates... If you have up to 30 dups per distinct row, even with a conservative estimate, your talking about dealing with may 150-200 times more rows than you need to. That in and of itself should gain you a large amount of performance and lessen the tempdb usage.
-Luke.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply