September 27, 2007 at 11:44 am
Really sounding like Analysis Services is the way to go here. It is REALLY good at preaggregating/precomputing and it does allow for updates to values too.
Sometimes we really can't make things run faster - there is just too much data. 🙁 In such cases we simply fall back to the "throw more hardware at it" resolution.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
September 29, 2007 at 10:03 am
Hi Bob...
Just confirming... how many rows does the following update?
4) The _Change fields would be the difference between the day before and the current day for the specific type. This can be quite painful.
;WITH X (RN,StoredCalcDataID,AsOfDate,[Weekly_Type1]) AS (
SELECT ROW_NUMBER() OVER (ORDER BY MainSeriesID ,theDate ) AS RN,
MainSeriesID ,theDate ,[Weekly_Type1]
FROM #ComplexFilter
WHERE WEEKLY=1)
UPDATE #TempTable
SET Weekly_Type1_Change =
CASE
WHEN Newer.Weekly_Type1 > Older.Weekly_Type1 THEN 1
WHEN Newer.Weekly_Type1 < Older.Weekly_Type1 THEN -1
ELSE 0 END
FROM #TempTable
INNER JOIN X AS Newer
ON Newer.MainSeriesID = #TempTable.MainSeriesID
AND Newer.theDate = #TempTable.theDate
INNER JOIN X AS Older
ON Older.RN + 1= Newer.RN
AND Older.MainSeriesID = Newer.MainSeriesID
--Jeff Moden
Change is inevitable... Change for the better is not.
September 29, 2007 at 3:51 pm
This part of it can vary widely. It will update 100% of the rows that are extracted to the temp table in most cases. The temp table will hold between 500k (this takes a few seconds at most) to 20M (which will take a few minutes for that update to run).
Also if they are looking at daily, weekly and monthly series, I have to run 3 virutally identical versions of that. Obviously the weekly would be 1 in 5 and the monthly would be 1 in 20. If you would just have weekly and monthly it would be different.
Now to allow me to go back and see that the delta for daily_type1 and weekly_type1 both exceeded 5 for example, I need to backfill in the weekly value for all days.
September 29, 2007 at 4:20 pm
Thanks, Bob...
First, updating 500K or even a million rows in a single update probably isn't a problem... 20 million is a problem. If you do some testing, I think you find a million row update will only take about 2.7 minutes (assuming a 6k row per second update). But, if you try 20 million rows, instead of only taking 20 times longer, it can take 200 or even 2000 times longer! Why? Hell, I dunno... all I know is that's what happens in the tests I've done. Some of it is because the LDF may need to grow. Some of it may be because TempDB needs to grow. Some of it might be because the swap file comes into play for updates that big. I don't know for sure though.
Anyway, you need to split the update into more manageable chuncks of a quarter million or so rows.
The other thing is, updates with a join can be a bugger... if you do it wrong, one row gets updated and then the join has to be 100% recalculated, over and over and over. Guess how much time that little gem will take? And, I'm thinking that's what's happened in your code... you're updating columns that appear in the CTE... I just can't help thinking that's a bad idea.
Getting ready to go to a movie with my sweetie... if I think of it, I'll try to write an example of how you might be able to do this so it runs in seconds instead of hours...
--Jeff Moden
Change is inevitable... Change for the better is not.
September 29, 2007 at 9:15 pm
I've had great luck with batching updates into smaller chunks. One long update might take an hour, but going with quarters (1/4) each time, I might get all 4 to run in 10 minutes.
October 1, 2007 at 6:24 am
I should be able to do some tests. I'll start with something like the following.
SELECT @Ctr = 1
WHILE(@Ctr > 0)
BEGIN
;WITH X (RN,StoredCalcDataID,AsOfDate,[Weekly_Type1]) AS (
SELECT TOP(250000) ROW_NUMBER() OVER (ORDER BY MainSeriesID ,theDate ) AS RN,
MainSeriesID ,theDate ,[Weekly_Type1]
FROM #ComplexFilter
WHERE WEEKLY=1 AND Weekly_Type1_Change IS NULL
)
UPDATE #TempTable
SET Weekly_Type1_Change =
CASE
WHEN Newer.Weekly_Type1 > Older.Weekly_Type1 THEN 1
WHEN Newer.Weekly_Type1 < Older.Weekly_Type1 THEN -1
ELSE 0 END
FROM #TempTable
INNER JOIN X AS Newer
ON Newer.MainSeriesID = #TempTable.MainSeriesID
AND Newer.theDate = #TempTable.theDate
INNER JOIN X AS Older
ON Older.RN + 1= Newer.RN
AND Older.MainSeriesID = Newer.MainSeriesID
SELECT @Ctr = @@ROWCOUNT
END
October 1, 2007 at 7:07 am
Without indexes this has no hope of performing better. Now in my other tests, the indexes were of no help because they slowed the process down signficantly on the insert, with less improvement than loss. That caused me to remove them. This however, might still work. I'll update as soon as I can get a better test.
October 1, 2007 at 9:09 am
Formatting in this new site is less than adequate.
I tried to post the table, but that didn't work so well.
Short answer is that Adding a PK to the process for a 3.3M row result set. Slowed the insert by about 103% of No index. Update improved to 97% of the update of all rows at a time....
However; after several runs, it was found that I have enough background load on my dev system to cause swings in performance by as much as an additional 2%.
End result. On average the multiple updates still performed worse than the batch update without indexes.
FYI, the BIG update WITH AN INDEX was 111% the performance of the small update. So it does appear that there is a penalty for big updates if there are indexes, but in this case without worked better.
So Jeff, you were right, but in this case I have the option to NOT use an index, and it did perform better.
Oh by the way. The performance:
Insert of 3.3m rows.
Best run (no index). 270 - seconds
worst run (with index) 283 - seconds
Update
200k chunks - 166 secs
all (no index) - 173 secs
All (with index) - 185 secs
October 1, 2007 at 11:34 am
Jeff Gray (9/26/2007)
Another thing you might try is to use the READ UNCOMMITTED transaction isolation level. That will reduce overhead to some degree.
I have run throught about 10 runs It looks like this does help. Average appears to be somewere around 6% improvement.
October 1, 2007 at 2:36 pm
I've had to deal with very long running updates and processor intensive stored procedures that has been able to take my dual quad core w/ 8gb to it's knees.
If off-hour processing is not available I've been able to run the an intensive process on a separate box that has a copy of the production database as a hot spare.
I have also written a specially designed process that throws a bunch of jobs on the job queue and divided up the work amongst all of the jobs. I have one such job that spawns 8 separate jobs. I can get 24 hrs of work done in 2. The premise here is to break up the workload into smaller more efficient chunks.
So there are ways to divide & conquer to get it done.
Kurt
DBA
RHWI, Inc
Poughkeepsie, NY
Kurt W. Zimmerman
SR DBA
Lefrak Organization
New York, NY
http://www.linkedin.com/in/kurtwzimmerman
October 1, 2007 at 7:01 pm
Kurt W. Zimmerman (10/1/2007)
I've had to deal with very long running updates and processor intensive stored procedures that has been able to take my dual quad core w/ 8gb to it's knees.
My word... Don't you think someone needs to fix that :blink:? Maybe a rewrite? 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
October 1, 2007 at 8:21 pm
Bob Fazio (10/1/2007)
Formatting in this new site is less than adequate.
I absolutely agree!
Insert of 3.3m rows.
Best run (no index). 270 - seconds
worst run (with index) 283 - seconds
Update
200k chunks - 166 secs
all (no index) - 173 secs
All (with index) - 185 secs
Not sure what's going on... I've got a single CPU 1.8Ghz box with IDE drives and 2 GB ram with Developer's Edition of SQL Server 2000 SP 4...
I inserted 3.3 million rows into a new table with a clustered PK in only 42 seconds.
I inserted 3.3 million "interleaved" rows (according to PK) in 156 seconds.
I inserted 3.3 non-interleaved rows in 72 seconds.
Something else is (maybe dreadfully) wrong but I just don't have SQL Server 2k5 to figure out what. It may be that the datatypes for the columns in the various tables/CTE aren't the same as they are in the target tables... perhaps there are some triggers on the tables (if they're not temp tables, of course)... perhaps you have missing or damaged stats... perhaps the indexes are fragged... perhaps, etc... dunno for sure.
Also, I updated 3.3 million rows (single batch, NOT the primary key) in only 26 seconds. Yeah, it was only a single column, but still... something else is going on and I'm just not sure what it is...
Here's the code I tested with (in the absence of your actual data)...
--===== Create and populate a 3,333,333 row test table.
-- Column RowNum has a range of 1 to 1,000,000 unique numbers
-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers
-- Column "SomeString" has a range of "AA" to "ZZ" non-unique 2 character strings
-- Column "SomeNumber has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times
-- Takes a bit over 3 minutes to build this initial talbe
SELECT TOP 3333333
RowNum = IDENTITY(INT,0,10),
SomeInt = CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT),
SomeString = CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))
+ CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),
SomeNumber = CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY),
SomeDate = CAST(RAND(CAST(NEWID() AS VARBINARY))*3653.0+36524.0 AS DATETIME)
INTO dbo.JBMTest
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2 --Lack of join criteria makes this a CROSS-JOIN
--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE dbo.JBMTest
ADD PRIMARY KEY CLUSTERED (RowNum)
GO
--===== Create a "target" table with a Clustered PK same as the source table
CREATE TABLE [dbo].[JBMTest1] (
[RowNum] [int] NOT NULL,
[SomeInt] [int] NULL ,
[SomeString] [char] (2) NULL ,
[SomeCSV] [varchar] (80) NULL ,
[SomeNumber] [money] NULL ,
[SomeDate] [datetime] NULL ,
PRIMARY KEY CLUSTERED
(
[RowNum]
)
)
GO
--===== Straight insert, no data in table
INSERT INTO jbmTest1 --(50 seconds)
(RowNum, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate)
SELECT RowNum, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate
FROM jbmTest
--===== Insert Data 100% interleaved in clustered index
INSERT INTO jbmTest1 --(156 seconds)
(RowNum, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate)
SELECT RowNum+5, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate
FROM jbmTest
--===== Insert data into the "end" of the table (non-interleaved)
INSERT INTO jbmTest1 --(72 seconds)
(RowNum, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate)
SELECT RowNum+33333326, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate
FROM jbmTest
--===== Table size is almost 10 million rows, now. Update 3.3 million of them
-- (26 seconds)
UPDATE jbmTest1
SET SomeDate = 0
WHERE RowNum <= 33333326/2
Like I said, previously, dunno for sure what the problem is and can't check because I don't have 2k5... but now you have some code you can check your box with.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 1, 2007 at 8:51 pm
My results on Express SP2
INSERT INTO jbmTest1 --(120 seconds)
(RowNum, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate)
SELECT RowNum, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate
FROM jbmTest
-- secs
--===== Insert Data 100% interleaved in clustered index
INSERT INTO jbmTest1 --(79 seconds)
(RowNum, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate)
SELECT RowNum+5, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate
FROM jbmTest
--===== Insert data into the "end" of the table (non-interleaved)
INSERT INTO jbmTest1 --(129 seconds)
(RowNum, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate)
SELECT RowNum+33333326, SomeInt, SomeString, SomeCSV, SomeNumber, SomeDate
FROM jbmTest
--===== Table size is almost 10 million rows, now. Update 3.3 million of them
-- (90 seconds)
UPDATE jbmTest1
SET SomeDate = 0
WHERE RowNum <= 33333326/2
October 1, 2007 at 9:13 pm
When I get some time tomorrow, I'll try your tests. But this goes back to the initial point of this post.
The insert is the pivot of the 1BILLION row table (actually it is a union of 2 250Million row tables and a view that is over a 500Million row table).... So ignore the insert total time. Most of it is from the query, not the insert. The index does impact the insert performance though.
As for the update. The example I gave, and the one you tested was for 1 column. The test run I gave was actually updating 4 columns in one pass. Type1,2,3 and 4... using case statements.
October 1, 2007 at 9:16 pm
Jeff,
The update I am doing is calculating the delta from row N-1 to row N. And storing that as an INT (1 for rising, 0 for same, -1 for falling) That is why this is hard/painful.
Viewing 15 posts - 16 through 30 (of 58 total)
You must be logged in to reply to this topic. Login to reply