October 25, 2010 at 8:48 am
Situation: ProdDB contains past three years of data. ArchiveDB contains everything else. Monthly process moves the oldest month of data from Prod to Arch. This is a base set of average 20,000 cases with their children entries. Some of those tables/datasets get quite large (one set is at least 966948920 rows). Initial process moved the records case by case. This of course was incredibly slow. I've refactored the process and it now runs in batches. This is running significantly faster, but I'm running into issues with tempdb.
ProdDB runs around 300 GB in size. TempDB is taking up over 100 GB. Yet I am still getting warnings that we are exceeding the max size / running out of space on the disk. The only thing I can figure is that the space used by a batch is not freeing up for usage by the next batch. Any ideas?
Here's a sample query I'm using for the move:
INSERT INTO CFARCH01.ArchDB.cf.table (col1, col2, col3)
SELECT DISTINCT col1, col2, col3
FROM ProdDB.cf.table2 as chp
INNER JOIN ProdDB.cf.table1 as crp
ON chp.correspondence_id = chp.correspondence_id
INNER JOIN ProdDB.cf.case_archive as cap
ON crp.case_number = cap.case_number
AND cap.case_int BETWEEN 501 and 1000
(case_archive is populated at the beginning of the process to indicate which cases need to be moved)
There are 29 tables with data that gets moved, not all of them that large. Some tables have IMAGE columns. (it's a 3rd party app but the archiving process is semi-custom)
October 25, 2010 at 10:16 am
I take it you weren't running into issues before you started processing on a batch level?
What else is running when you're doing this?
TempDB holds the result sets of your batches while it's spooling it to the new database. If you're having I/O issues, the data would continue "holding" in TempDB until it gets fed to the new DB. Are both databases on the same instance? Have you checked your PerfMon lately?
October 25, 2010 at 10:29 am
You could have a look here: http://msdn.microsoft.com/en-us/library/ms176029.aspx
October 25, 2010 at 10:31 am
Are both databases, or even all 3 databases on the same drives?
Do you separate log files from data files on different drives?
Do you have enough ram to run this process?
October 25, 2010 at 10:48 am
Brandie:
Correct, before the switch to a batch process we did not have tempdb issues.
There are possibly some reports running (originating on Arch server but querying Prod - that's changing) but that would be about it.
ProdDB and ArchDB are on two separate VMWare servers. The issue is with the tempdb on the Prod server.
Ninja's
We are running in a SAN environment with balancing on the SAN, the logs and data files are defined to different "drives" within this.
The two virtual servers are isolated from the rest of the VM machines. Prod has 8 GB RAM allocated, Arch has 14 GB RAM defined.
October 25, 2010 at 10:57 am
Pam,
Just a note. Virtual Servers make no difference if they're using the same SAM setup. You need to find out if they are.
500 records should not be causing 100 GB of growth in any database. This is why I asked if you'd checked your PerfMon (Performance Monitor) for I/O issues lately. Ninja is also correct in that you should verify your RAM. PerfMon can also monitor that.
Look at the default trace from when this was run. See if you can find what was happening at the instant TempDB got those errors. Also look at the execution plan for your query. Is something getting looped?
This strikes me as a possible infinite loop situation, but you have no loop.
October 25, 2010 at 12:28 pm
Brandie, the initial set is 500 cases, but with those cases there are 29 tables and in those tables there are at times millions of entries / table. Granted, that's still too much growth given the size of the entire DB but not quite as alarming as it seems.
I can get through 2 batches of 200 cases each just fine, it's filling up / erroring out on the third batch. The insert statement which is the one showing up in the error is the one above (obviously names changed but that's it). For that particular table and batch, 5,895,720 rows must be moved.
Our monitoring tools are showing no pegging of RAM or CPU. On the Prod server, highest CPU is 40%, highest RAM is around 70%. Arch server CPU max 27.5%, RAM max 15%.
There is a loop on the batch to get through the full 20,000+ cases / month, but this is not an infinite loop situation. This code is tested and QA'd (was just on a smaller data set so we didn't have the tempdb issues).
October 25, 2010 at 1:01 pm
Hi,
are you moving your data through linked server?
are you using dynamic sql to execute the batches?
have you examine the tables inside tempdb?
Greets
pitcher
[font="Arial"]Kind regards,
Patrick Fiedler
Consultant
October 25, 2010 at 1:07 pm
Pitcher:
Yes, it's a linked server. Query runs on Prod to gather the data, inserts into Arch via linked server.
Yes, it's dynamic SQL for the inserts. Terribly ugly I know, but unavoidable as I'm dealing with third party software / database design.
No, I've not (yet) looked at the tables/data within tempdb. As this is a production system with high visibility and a high demand of up time I need to tread as lightly as possible.
October 25, 2010 at 1:35 pm
An idea could be to get the data into arch through linked server on arch pointing to prod. So you can look whether the tempdb of prod would decreasing. if now the tempdb is also increasing, you could analyse the different tables in tempdb on arch.
Dynamic sql is not bad in any case. You should try dynamic sql perhaps if the execution plan inside a proc is suboptimal. With dynamic sql the statements within will create a new execution plan.
Greets
pitcher
[font="Arial"]Kind regards,
Patrick Fiedler
Consultant
October 25, 2010 at 11:11 pm
Pam Brisjar (10/25/2010)
Here's a sample query I'm using for the move:
INSERT INTO CFARCH01.ArchDB.cf.table (col1, col2, col3)
SELECT DISTINCT col1, col2, col3
FROM ProdDB.cf.table2 as chp
INNER JOIN ProdDB.cf.table1 as crp
ON chp.correspondence_id = chp.correspondence_id
INNER JOIN ProdDB.cf.case_archive as cap
ON crp.case_number = cap.case_number
AND cap.case_int BETWEEN 501 and 1000
Two questions:
1) Is it a typo in the code you posted, or do you really have the query cross joining table2 and table1? Should "ON chp.correspondence_id = chp.correspondence_id" really be "ON chp.correspondence_id = crp.correspondence_id"?
2) Did you add DISTINCT in to try to solve the cartesian problem?
It seems to me like that is likely your problem, but I would hope that something that simple didn't get overlooked in the QA and testing process.
October 26, 2010 at 5:03 am
Pam,
Is there a reason you can't use SSIS to do this move?
October 26, 2010 at 5:07 am
pitcher166 (10/25/2010)
Dynamic sql is not bad in any case. You should try dynamic sql perhaps if the execution plan inside a proc is suboptimal. With dynamic sql the statements within will create a new execution plan.
Actually, I disagree with that statement precisely because there's a new execution plan generated every time the sql runs, and then it can only be generated on the fly when the statement is run, not before.
Depending on how the dynamic SQL is written, you could be hosing the database with just that statement. The more I think about it, the more I think you need to get away from the dynamic sql. Especially if it's the same 29 tables each time, there's no reason why you can't make your code non-dynamic and work from there.
What side of the "equation" is the code running on? Prod or Arch? [EDIT] Nevermind. I just realized you're running it on the Prod side. Have you tried running the code on the Arch side? It'll require a small rewrite and would be interesting to see if the rewrite resolves the issue.
October 26, 2010 at 7:32 am
Brandie,
If I were writing this from scratch with control of the DDL, etc. I would most definitely be writing this as an SSIS package with no dynamic SQL. I am aware of how awful it is. However, this process is to archive data off of a vendor's application so I am running with some serious constraints in terms of what liberties I can take with the code. In fact, what I have done is just a simple rewrite of the vendor's original script (well actually not that simple but I'm not letting on how much I've actually done as I was only allocated to spend a certain number of hours on this...). To do anything more would require an official "project" and I'm already way over-allocated on those through 2011.
UMG Developer,
You know, in the back of my mind I suspected it was something simple. You were right - there's a typo in the code that got completely missed. Now to do another re-re-re check to see if I missed anything else. Then I'll re-run this weekend and hope it works.
October 26, 2010 at 7:49 am
Well, I'm glad you found what could be the error.
I'm a little confused as to why you need to take liberties with the code for SSIS. Usually, you just need source and destination and maybe a query you write up yourself. But you know your system better than I do.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply