December 14, 2010 at 2:36 pm
Alrighty. So, status update on the processing so far.
I am still doing an all file load to the _staging table. The reason for this necessity is some business rules that need to get applied across multiple files to deal with duplicated records between batches. Wayne: I haven't tried the other index because I completely redid the way I was running the proc, so I no longer need lookbacks to PSA_stg.
I've cured the majority of the disparities in the DDL:
USE [PHX_Cello]
GO
/****** Object: Table [dbo].[PortSecAgg] Script Date: 12/14/2010 13:45:47 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[PortSecAgg](
[PortfolioID] [int] NOT NULL,
[SecurityID] [int] NOT NULL,
[AsOfDate] [datetime] NOT NULL,
[IsShortPosition] [bit] NOT NULL,
[PortfolioCode] [varchar](10) NULL,
[Symbol] [varchar](20) NULL,
[Quantity] [numeric](18, 4) NULL,
[SecTypeCode] [varchar](5) NULL,
[Cusip] [varchar](15) NULL,
[UnitCost] [money] NULL,
[TotalCost] [money] NULL,
[Price] [money] NULL,
[MarketValue] [money] NULL,
[AccruedInterest] [money] NULL,
[UnrealizedGL] [money] NULL,
[IsSubAccount] [bit] NULL,
CONSTRAINT [PK_PortSecAgg] PRIMARY KEY CLUSTERED
(
[AsOfDate] ASC,
[PortfolioID] ASC,
[SecurityID] ASC,
[IsShortPosition] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
USE [PHX_Cello]
GO
/****** Object: Table [dbo].[PortSecAgg_staging] Script Date: 12/14/2010 13:45:53 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[PortSecAgg_staging](
[PortfolioID] [int] NOT NULL,
[SecurityID] [int] NOT NULL,
[AsOfDate] [datetime] NOT NULL,
[IsShortPosition] [bit] NOT NULL,
[PortfolioCode] [varchar](10) NULL,
[Symbol] [varchar](20) NULL,
[Quantity] [numeric](18, 4) NULL,
[SecTypeCode] [varchar](5) NULL,
[Cusip] [varchar](15) NULL,
[UnitCost] [money] NULL,
[TotalCost] [money] NULL,
[Price] [money] NULL,
[MarketValue] [money] NULL,
[AccruedInterest] [money] NULL,
[UnrealizedGL] [money] NULL,
[IsSubAccount] [bit] NULL,
[LoadFileName] [varchar](75) NULL,
[BatchGroup] [varchar](20) NOT NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
/****** Object: Index [idx_PortSecAgg_staging] Script Date: 12/14/2010 13:45:53 ******/
CREATE CLUSTERED INDEX [idx_PortSecAgg_staging] ON [dbo].[PortSecAgg_staging]
(
[AsOfDate] ASC,
[PortfolioID] ASC,
[SecurityID] ASC,
[IsShortPosition] ASC,
[BatchGroup] ASC
)
GO
/****** Object: Index [idx_PortSecAgg_staging_3] Script Date: 12/14/2010 13:45:53 ******/
CREATE NONCLUSTERED INDEX [idx_PortSecAgg_staging_3] ON [dbo].[PortSecAgg_staging]
(
[BatchGroup] ASC,
[AsOfDate] ASC,
[PortfolioID] ASC,
[SecurityID] ASC,
[IsShortPosition] ASC
) ON [PRIMARY]
My SSIS package now handles the majority of the work. For the curious, attached are two screen shots of the processing. One is the Control Flow, the other is the single Data Flow element.
This processes around 260 files, each containing ~175k rows, in 39 minutes, not including the 11 minute index rebuild that occurs after the file insertion. Just shy of 50 million records in 50 minutes, avg. 1million per minute... I can live with that.
Next was post-import processing off staging. This is where I'm in the middle of working and I'm looking into the partition swapping that Paul discussed in detail earlier. However, I wanted to bring to light some information regarding the loop delete. As Paul discussed, it is doing range deletes without a spool much cleaner than the #table delete. It also takes a LOT longer, depending on the # of PortfolioID/AsOfDate combinations exist.
For reference:
With this code, which is the loop-range delete:
DECLARE DeleteCursor CURSOR FAST_FORWARD FOR
SELECTPortfolioID, AsOfDate
FROM#DelPortList
OPEN DeleteCursor
FETCH NEXT FROM DeleteCursor INTO @PortfolioID, @AsOfDate
WHILE @@FETCH_STATUS = 0
BEGIN
DELETE --psa
FROM
PortSecAgg --AS psa WITH (TABLOCKX)
--JOIN
--#DelPortList AS dpl WITH (TABLOCKX)
--ONpsa.PortfolioID = dpl.PortfolioID
--AND psa.AsOfDate = dpl.AsOfDate
WHERE
PortfolioID = @PortfolioID
AND AsOfDate = @AsOfDate
FETCH NEXT FROM DeleteCursor INTO @PortfolioID, @AsOfDate
END
CLOSE DeleteCursor
DEALLOCATE DeleteCursor
I received these results:
WITH the loop for the delete:
Step - Find duplicated AsOfDate/PortIDs between batches: 10 seconds.
Step - Remove duplicated AsOfDate/PortIDs between batches: 0 seconds.
Step - Declare and open cursor: 0 seconds.
Step - Insert into #DelPortList for Batch 0001: 3 seconds.
Step - Delete from PortSecAgg for Batch 0001: 1161 seconds.
Step - Insert New to PortSecAgg for Batch 0001: 574 seconds.
Step - Insert into #DelPortList for Batch 0002: 3 seconds.
Step - Delete from PortSecAgg for Batch 0002: 1102 seconds.
Step - Insert New to PortSecAgg for Batch 0002: 528 seconds.
Step - Insert into #DelPortList for Batch 0003: 3 seconds.
Step - Delete from PortSecAgg for Batch 0003: 1178 seconds.
Step - Insert New to PortSecAgg for Batch 0003: 502 seconds.
Step - Insert into #DelPortList for Batch 0004: 4 seconds.
-> I stopped it manually here.
With this code:
DELETE psa
FROM
PortSecAgg AS psa WITH (TABLOCKX)
JOIN
#DelPortList AS dpl WITH (TABLOCKX)
ONpsa.PortfolioID = dpl.PortfolioID
AND psa.AsOfDate = dpl.AsOfDate
I got these results:
Step - Find duplicated AsOfDate/PortIDs between batches: 3 seconds.
Step - Remove duplicated AsOfDate/PortIDs between batches: 0 seconds.
Step - Declare and open cursor: 0 seconds.
Step - Insert into #DelPortList for Batch 0001: 3 seconds.
Step - Delete from PortSecAgg for Batch 0001: 2 seconds.
Step - Insert New to PortSecAgg for Batch 0001: 685 seconds.
Step - Insert into #DelPortList for Batch 0002: 3 seconds.
Step - Delete from PortSecAgg for Batch 0002: 204 seconds.
Step - Insert New to PortSecAgg for Batch 0002: 626 seconds.
Step - Insert into #DelPortList for Batch 0003: 2 seconds.
Step - Delete from PortSecAgg for Batch 0003: 97 seconds.
Step - Insert New to PortSecAgg for Batch 0003: 503 seconds.
Step - Insert into #DelPortList for Batch 0004: 4 seconds.
Step - Delete from PortSecAgg for Batch 0004: 99 seconds.
Step - Insert New to PortSecAgg for Batch 0004: 550 seconds.
Step - Insert into #DelPortList for Batch 0005: 4 seconds.
Step - Delete from PortSecAgg for Batch 0005: 97 seconds.
Step - Insert New to PortSecAgg for Batch 0005: 572 seconds.
So, the looped-range delete may be faster, but it's going to depend more on the # of loops. If you've only got a small subsection of combinations to remove, I'm sure it is quicker. The .sqlplan agrees. The volume of iterations though seemed to kill it completely.
One of the interesting things is that it doesn't always generate the spool. Attached are two query plans from the second set. Not that for batch 0001, it didn't generate a spool. For Batch 0002, it did. If I can figure out what the difference between the two are (the statements are the same), I'll be sitting pretty.
Now I'm up to the insert. It's not exactly a horrible process, and since the clustered on the two tables match (although, _staging has an extra column in the B-Tree, and two additional columns), I believe I'm just dealing with allocation time. I've attached the current script as is, along with a sqlplan for one of the batch's insert. (Batch method for sp_update_psa.txt).
Of note, this insert into ignores the IDX_3 (which uses the BatchGroup as the header), and doesn't care about the indexing in psa_stg. I'm going to try a swap on the clustered to the BatchGroup column (the missing index previously discussed that Wayne caught), and completely destroy idx_3, see where that gets me. If it fails to arrive at a satisfactory result... Partitioning with the swappable table shall be heavily investigated. 🙂
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 14, 2010 at 3:19 pm
Ignore this, bad post.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 14, 2010 at 3:33 pm
Any chance to post the index def for #DelPortList and PortSecAgg? It seems like the clustered indexes don't match.
And now I'm totally guessing: batch001 doesn't have a spool since there are no matching rows (check the input arrow into the NestedLoop). it seems like the index statistics of both tables indicate that the data don't match at all so SQL Server is not even trying to find a match (I'm sure Paul will come up with a valid reasoning for or against it...). If you look at the actual number of rows you'll see a zero all the way down to the index delete...
For batch002 on the other side it seems like index statistics will result in a probable match. But since the indices don't seem to match you end up with a scan on both sides.
Again: plain made-up theory... :blush:
Btw: do you have any stats available how long the process took at the very beginning compared to where you're at right now? (just rough figures to -hopefully- keep us motivated 😀 )
December 14, 2010 at 3:37 pm
JAWDROP.
Step - Find duplicated AsOfDate/PortIDs between batches: 75 seconds.
Step - Remove duplicated AsOfDate/PortIDs between batches: 0 seconds.
Step - Declare and open cursor: 0 seconds.
Step - Insert into #DelPortList for Batch 0001: 16 seconds.
Step - Delete from PortSecAgg for Batch 0001: 89 seconds.
Step - Insert New to PortSecAgg for Batch 0001: 68 seconds.
Step - Insert into #DelPortList for Batch 0002: 15 seconds.
Step - Delete from PortSecAgg for Batch 0002: 90 seconds.
Step - Insert New to PortSecAgg for Batch 0002: 95 seconds.
Step - Insert into #DelPortList for Batch 0003: 15 seconds.
Step - Delete from PortSecAgg for Batch 0003: 153 seconds.
Step - Insert New to PortSecAgg for Batch 0003: 78 seconds.
Step - Insert into #DelPortList for Batch 0004: 15 seconds.
Step - Delete from PortSecAgg for Batch 0004: 20 seconds.
Step - Insert New to PortSecAgg for Batch 0004: 70 seconds.
Step - Insert into #DelPortList for Batch 0005: 16 seconds.
Step - Delete from PortSecAgg for Batch 0005: 92 seconds.
Step - Insert New to PortSecAgg for Batch 0005: 92 seconds.
Seriously?! The BatchGroup as the ONLY piece of the clustered index did THIS?!
My current index specification:
USE [PHX_Cello]
GO
/****** Object: Index [idx_PortSecAgg_staging] Script Date: 12/14/2010 15:33:07 ******/
CREATE CLUSTERED INDEX [idx_PortSecAgg_staging] ON [dbo].[PortSecAgg_staging]
(
[BatchGroup] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
.sqlplan attached. It's doing a clustered seek instead of a scan now. Mind you, there's a total of 5 different batch groups in PSA_stg. That's ~20% selectivity. It's still seeking. I didn't even know it could do that at that horrendous level of selectivity.
Side note: None of the DELETE statements spooled, either.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 14, 2010 at 3:45 pm
LutzM (12/14/2010)
Any chance to post the index def for #DelPortList and PortSecAgg? It seems like the clustered indexes don't match.
They don't, not directly.
#DelPortList's index is in the proc, listed here for easy reference:
CREATE TABLE #DelPortList
(PortfolioID INT NOT NULL,
AsOfDateDATETIME NOT NULL,
PRIMARY KEY ( AsOfDate, PortfolioID) )
Clustered on PortSecAgg:
USE [PHX_Cello]
GO
/****** Object: Index [PK_PortSecAgg] Script Date: 12/14/2010 15:39:48 ******/
ALTER TABLE [dbo].[PortSecAgg] ADD CONSTRAINT [PK_PortSecAgg] PRIMARY KEY CLUSTERED
(
[AsOfDate] ASC,
[PortfolioID] ASC,
[SecurityID] ASC,
[IsShortPosition] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
They match via the first two fields, the leading edge(s), since I don't need the rest in this instance. The SecID/IsShort gets used in other queries later, and is also the PK, so I left it all in one component as the PK constraint/Clustered Index.
And now I'm totally guessing: batch001 doesn't have a spool since there are no matching rows (check the input arrow into the NestedLoop). it seems like the index statistics of both tables indicate that the data don't match at all so SQL Server is not even trying to find a match (I'm sure Paul will come up with a valid reasoning for or against it...). If you look at the actual number of rows you'll see a zero all the way down to the index delete...
Even when they match, now they don't spool. No, I don't know why.
Btw: do you have any stats available how long the process took at the very beginning compared to where you're at right now? (just rough figures to -hopefully- keep us motivated 😀 )
When I started this thread, these 50 million rows took 8+ hours to compute. I am down to ~65 minutes. 39 minutes for the file load, which handles a lot of the cleanup work now. 11 minutes for the dual index build (I haven't retested since I modded these indexes). I'm assuming with only one index on the BatchGroup that should go to ~14 minutes, which is what CREATE CLUSTERED INDEX idx_PortSecAgg_staging ON PortSecAgg_staging (BatchGroup) took from the SSMS. No, I don't know why it took longer than two indexes, but I'll get back to you on that.
The rest of the process took 16 minutes. I expect the first component, the duplicate removal part, will take longer when it hits than it used to because of the change in clustered index on PSA_stg. I'll keep an eye on that and possible re-introduce a second NC index.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 14, 2010 at 11:16 pm
Craig,
For the deletes, use this:
DELETE psa
FROM #DelPortList AS dpl WITH (TABLOCKX)
JOIN PortSecAgg AS psa WITH (TABLOCKX)
ON psa.PortfolioID = dpl.PortfolioID
AND psa.AsOfDate = dpl.AsOfDate
OPTION (FORCE ORDER, LOOP JOIN);
The plan is a loop over PortfolioID/AsOfDate combinations (look at the outer references on the loop join) - exactly the same idea as the cursor. The important thing is avoiding the scan, hash join, sort, and spool. The hints just force the cursor-equivalent plan.
There's no Halloween protection with the loop join plan for reasons that should be obvious to anyone that knows what Halloween protection is 😉 😎
The hash plan doesn't need the table spool because the sort is a blocking operator and so provides the Halloween protection - SQL 2005 doesn't seem to realise this. The same tables on 2008 produce the same plan but without the spool.
You commented out the TABLOCKX hint on the cursor plan, which won't have helped at all! Nevertheless, the loop join version above is exactly equivalent, shorter, and a bit more efficient. Go with it, with the hints.
Why does the INSERT plan make your jaw drop? The seek fetches the batch group (what plan alternative is there?) and there's a parallel minor sort to get the rows in destination clustered index order. What were you expecting here? I'm a bit confused at this point.
Perhaps you are surprised that DELETE followed by INSERT is so much faster than your original process? Think about it!
Paul
edit: sorry for the short reply, I'm pressed for time right now
December 15, 2010 at 1:38 am
Paul White NZ (12/14/2010)
...There's no Halloween protection with the loop join plan for reasons that should be obvious to anyone that knows what Halloween protection is 😉 😎
...
What would it help in Craigs scenario to lock the door, dim the light and ignore the ringing bell? :unsure: 😀
December 15, 2010 at 1:50 am
LutzM (12/15/2010)
What would it help in Craigs scenario to lock the door, dim the light and ignore the ringing bell? :unsure: 😀
The last thing Craig needs right now is kids playing tricks on his package. :laugh:
December 15, 2010 at 2:26 am
Paul White NZ (12/15/2010)
LutzM (12/15/2010)
What would it help in Craigs scenario to lock the door, dim the light and ignore the ringing bell? :unsure: 😀The last thing Craig needs right now is kids playing tricks on his package. :laugh:
Understood. So, Halloween protection is important regardless of the interpretation.
I step back then... 😉
December 15, 2010 at 7:17 am
Paul White NZ (12/15/2010)
LutzM (12/15/2010)
What would it help in Craigs scenario to lock the door, dim the light and ignore the ringing bell? :unsure: 😀The last thing Craig needs right now is kids playing tricks on his package. :laugh:
Yeah, but we're not kids, and he's a jokester, so it's all fair game! 😀
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
December 15, 2010 at 10:54 am
Paul White NZ (12/14/2010)
Craig,For the deletes, use this:
DELETE psa
FROM #DelPortList AS dpl WITH (TABLOCKX)
JOIN PortSecAgg AS psa WITH (TABLOCKX)
ON psa.PortfolioID = dpl.PortfolioID
AND psa.AsOfDate = dpl.AsOfDate
OPTION (FORCE ORDER, LOOP JOIN);
The plan is a loop over PortfolioID/AsOfDate combinations (look at the outer references on the loop join) - exactly the same idea as the cursor. The important thing is avoiding the scan, hash join, sort, and spool. The hints just force the cursor-equivalent plan.
I'll take a look at it. The Cursor plan itself wasn't so joyous, so I'll get some benchmarks (my DBAs have been messing with files on me, I haven't actually completed a benchmark in a day...) when I can and then test that out.
There's no Halloween protection with the loop join plan for reasons that should be obvious to anyone that knows what Halloween protection is 😉 😎
Which I apparently don't, even after going hunting around the web a bit. (Note, a lot of blogs are blocked by websense from work, so if the answers are hiding in those...). My understanding for Halloween Protection is that it's Update Based, so that when you change a value in an index it doesn't move further along and get changed a second time, so all the reads must occur first. Most of them reference the IBM salary incident.
Since there's no update anymore, I didn't think you'd need Halloween Protection here.
The hash plan doesn't need the table spool because the sort is a blocking operator and so provides the Halloween protection - SQL 2005 doesn't seem to realise this. The same tables on 2008 produce the same plan but without the spool.
You commented out the TABLOCKX hint on the cursor plan, which won't have helped at all! Nevertheless, the loop join version above is exactly equivalent, shorter, and a bit more efficient. Go with it, with the hints.
:blush: Whoops! I'll give it a try.
Why does the INSERT plan make your jaw drop? The seek fetches the batch group (what plan alternative is there?) and there's a parallel minor sort to get the rows in destination clustered index order. What were you expecting here? I'm a bit confused at this point.
The SEEK being faster isn't what made me amazed. The SEEK occuring at all did. The selectivity is horrendous. I've apparently forgotten or lost something very important at the basic levels of indexing and selectivity and need to go back to ground zero on this for a bit.
edit: sorry for the short reply, I'm pressed for time right now
Your short is everyone else's thorough. Thanks for the notes. 🙂
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 15, 2010 at 10:56 am
WayneS (12/15/2010)
Paul White NZ (12/15/2010)
LutzM (12/15/2010)
What would it help in Craigs scenario to lock the door, dim the light and ignore the ringing bell? :unsure: 😀The last thing Craig needs right now is kids playing tricks on his package. :laugh:
Yeah, but we're not kids, and he's a jokester, so it's all fair game! 😀
Heheh, at the rate this thread is making me feel like I don't know what the hell I'm doing in SQL Server, I'm starting to think I should have created a new logon under the name 'Headless Horseman'! At least I'd be anonymous! I wonder if I can just wear a pumpkin on my head...
Thanks to all, as usual. 😀
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 15, 2010 at 11:24 am
Craig Farrell (12/15/2010)
WayneS (12/15/2010)
Paul White NZ (12/15/2010)
LutzM (12/15/2010)
What would it help in Craigs scenario to lock the door, dim the light and ignore the ringing bell? :unsure: 😀The last thing Craig needs right now is kids playing tricks on his package. :laugh:
Yeah, but we're not kids, and he's a jokester, so it's all fair game! 😀
Heheh, at the rate this thread is making me feel like I don't know what the hell I'm doing in SQL Server, I'm starting to think I should have created a new logon under the name 'Headless Horseman'! At least I'd be anonymous! I wonder if I can just wear a pumpkin on my head...
Thanks to all, as usual. 😀
I don't know if you noticed this, but we're all paying attention to Paul also. He can definitely make me feel like I don't know what I'm doing either!
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
Viewing 13 posts - 61 through 72 (of 72 total)
You must be logged in to reply to this topic. Login to reply