Please Help - trying to eliminate "used tempdb to spill data"

Question

Please Help - trying to eliminate "used tempdb to spill data"

JohnG69

Hall of Fame

Points: 3114
More actions
July 10, 2015 at 7:05 am

#302343

Hi All,
I've been trying to make the following query more performant by breaking it up into smaller pieces.
SELECT MT.A3+MT.A4 AS A34,MT.A3
-- ,M.*
FROM
Master_TAB M
JOIN (SELECT M.A1,t3.A3,t3.A4,M.A6,M.A2,ROW_NUMBER() OVER (PARTITION BY A1,A6,A3,A4 ORDER BY A5 DESC) AS rownum
FROM Master_TAB M
JOIN TABle2 t2 ON M.A2=t2.A2
JOIN Table3 t3 ON t2.A2=t3.A2
) MT ON M.A1=MT.A1 AND M.A2=MT.A2 AND MT.rownum=1
ORDER BY MT.A3+MT.A4,MT.A1;
I know that the Spill is caused by the Sort but I can't remove the sort (sort can't be done in front end).
my master table had 1.7 million rows and almost 200 columns (bad design? I know but can't be changed as there's too much that would be affected)
every row is little over 1KB
here's my attempt...
-- MASTER_TAB has 1.7 million rows and 50 columns
CREATE TABLE [dbo].[tmp_ABC](
[A1] [varchar](13) NOT NULL,
[A2] [varchar](5) NOT NULL,
[A3] [varchar](4) NOT NULL,
[A4] [varchar](4) NOT NULL,
[A5] [int] NULL
) ON [PRIMARY]
-- inserted data into tmp_ABC from master Table MASTER_TAB (PK= A1,A2,A3)
create index IDX_tmp_ABC on [tmp_all_price_Target] (A1, A3, A4, A5) INCLUDE (A2)
CREATE TABLE [dbo].[tmp_DEF](
[A1] [varchar](13) NOT NULL,
[A2] [varchar](5) NOT NULL,
[A3] [varchar](4) NOT NULL,
[A4] [varchar](4) NOT NULL,
[A5] [int] NULL
) ON [PRIMARY]
insert into tmp_DEF
Select p.A1, t.A2, p.A3, p.A4,p.A5
from [tmp_ABC] t
join (SELECT A1, A3, A4, max(A5) as A5
FROM tmp_ABC
group by A1, A3, A4) pon t.A1 = p.A1 and t.A3 = p.A3 and t.A4=p.A4 and t.A5 = p.A5
create index IDX_tmp_DEF on [tmp_all_UPC_Sto_Loc_for_pri] (A1,A2, A3, A4, A5)
create index IDX_tmp_DEF_sort on [tmp_all_UPC_Sto_Loc_for_pri] (A3,A4) INCLUDE (A1,A2, A5)
-- this is the Query that is causing the Spill (in reality I'm supposed to bring back all 200 columns fro the master table but for debug purposes I limited the columns)
Select c.A3+c.A4 as A34, c.A3, c.A1
-- M.*
from tmp_DEF c
join MASTER_TAB M on M.A1 = c.A1 and M.A2 = c.A2
order by c.A3, C.A4
if I just run the following I get no spill:
Select c.A3+c.A4 as A34, c.A3, c.A1
from tmp_DEF c
order by c.A3, C.A4
as soon as I add the Master table as a Join I get the Spill...
I read many articles, tried many suggested things (creating indexes... clustered, non-clustered) without success.
Maybe I'm totally in Left Field and should enhance the performance going another route?
I've been breaking my head for the past 3 days and can't seem to figure it out... I had to turn for help here...
Thank you in advanced
JG

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

sgmunson SSC Guru Points: 110651 More actions · Answer 1

Just curious as to whether you might be allowed to add the ROW_NUMBER() column to the table as a computed, persisted column... Might simplify things quite a bit, although you might pay for it on INSERTs. Just a thought...

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

JohnG69 Hall of Fame Points: 3114 More actions · Answer 2

Hi sgmunson,

I can't touch the Master Table...

but a computed persistent column is virtual and calculated so it wouldn't affect the inserts or updates which might be a good idea...

but how will it affect the performance?

row numbers can change with every insert... not sure how this will work...

maybe explain more what and how you see it...

I appreciate this very much.

thanks

JG

Jonathan AC Roberts SSCoach Points: 18408 More actions · Answer 3

I think you can optimise this by using a CROSS APPLY and TOP(1). Doing it with this method you should be able to remove the master table from the subquery and join to the outermost Master_TAB.

SELECT MT.A3+MT.A4 AS A34,

MT.A3

-- ,M.*

FROM Master_TAB M

CROSS APPLY (SELECT TOP(1)

t3.A3,

t3.A4

FROM TABle2 t2

INNER JOIN Table3 t3

ON t2.A2=t3.A2

WHERE t2.A2 = M.A2

ORDER BY A5 DESC

) AS MT

ORDER BY MT.A3 + MT.A4, M.A1;

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 4

-- The outer query makes little sense, it only adds duplicate rows to a set which you're eliminating rows from

SELECT MT.A3+MT.A4 AS A34, MT.A3

-- ,M.*

FROM Master_TAB M

JOIN (

SELECT M.A1, t3.A3, t3.A4, M.A6, M.A2, ROW_NUMBER() OVER (PARTITION BY A1, A6, A3, A4 ORDER BY A5 DESC) AS rownum

FROM Master_TAB M

JOIN TABle2 t2

ON M.A2 = t2.A2

JOIN Table3 t3

ON t2.A2 = t3.A2

) MT

ON M.A1=MT.A1 AND M.A2=MT.A2 AND MT.rownum=1

ORDER BY MT.A3+MT.A4,MT.A1;

-- Try variations on this

SELECT MT.A3 + MT.A4 AS A34, MT.A3

-- ,M.*

FROM (

SELECT

--M.A1,

t3.A3,

t3.A4,

--M.A6, M.A2,

ROW_NUMBER() OVER (PARTITION BY A1, A6, A3, A4 ORDER BY A5 DESC) AS rownum

FROM Master_TAB M

JOIN TABle2 t2

ON M.A2=t2.A2

JOIN Table3 t3

ON t2.A2=t3.A2

) MT

WHERE MT.rownum=1

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

JohnG69 Hall of Fame Points: 3114 More actions · Answer 5

normally it would work, but the t2 table is a table that contains locations and groups of locations and the master table is the item master which contains pricing for groups of locations and single locations... depending on the priority which is A5 the price goes to a group or a location... each location can only get one item price.

the t3 table is the links between the locations and groups of locations...

the priority is set at the location and group level...

if Location X is part of 5 groups + itself and an item is only defined for 3 groups, the query should return the highest group of the 3 groups (not 6)

I hope this clarifies things...

JohnG69 Hall of Fame Points: 3114 More actions · Answer 6

I should add that the final result has to be the item, the location and the price

example if Loc1, Loc2, and Loc3 are part of Group1 and the item price is for Group1,

although the master table contains

Item X, Group1

the final query should return:

item X, Loc1, priceY

item X, Loc2, priceY

item X, Loc3, priceY

and the sort is to sort by location.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 7

Couple of things. First, it would help if you could post the DDL for the table(s) involved, some sample data for the table(s) that is representative of the problem domain (again, sample data not production data), and the expected results of the query.

All of this should be readily consumable (in other words all we need to do to is cut/paste/execute in SSMS).

Regarding the DDL for the table(s), you only need to provide the columns needed for the query, not all 200 or how ever many there are if only 50 to 10 are actually used. Also need the indexes on the tables, and of course this may mean adding additional columns to the DDL for the table(s).

Second, how often is this query run and is there any filtering done to reduce the data returned?

JohnG69 Hall of Fame Points: 3114 More actions · Answer 8

I will try to work on sample data and DDL info.... and post it tonight.

thanks

JG

JohnG69 Hall of Fame Points: 3114 More actions · Answer 9

Here it is... This script creates and populates the necessary tables to optimize the final query.

The final Query spills data to Tempdb which I'm trying to avoid.

CREATE TABLE [dbo].[Master_TAB](

[UPC] [varchar](13) NOT NULL,

[LocGroup] [varchar](5) NOT NULL,

[priceLvl] [int] NOT NULL,

[Price] [money] NULL,

CONSTRAINT [Master_TAB_P01] PRIMARY KEY CLUSTERED

([UPC] ASC,

[LocGroup] ASC,

[PriceLvl] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [MASTER_TAB_LocGroup] ON [dbo].[Master_TAB]

([LocGroup] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)

CREATE NONCLUSTERED INDEX [MASTER_TAB_PriceLvl] ON [dbo].[Master_TAB]

([PriceLvl] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)

GO

insert into [Master_TAB] values ('ABC','ALL',1,10.25)

insert into [Master_TAB] values ('ABC','00002',1,10.55)

insert into [Master_TAB] values ('ABC','Grp1',1,10.75)

insert into [Master_TAB] values ('DEF','ALL',1,10.25)

insert into [Master_TAB] values ('DEF','00001',1,10.55)

insert into [Master_TAB] values ('DEF','Grp2',1,10.75)

-- insert 10000 skus

Declare @Cnt int

Set @Cnt = 0

while @cnt <10000

begin

insert into [Master_TAB]

Select 'ABC' + convert(varchar(10),@cnt), 'ALL',1,10.25

set @cnt = @cnt + 1

end

CREATE TABLE [dbo].[GroupLoc](

[GroupLoc] [varchar](5) NOT NULL,

[Priority] [int] NULL,

CONSTRAINT [GroupLoc_GroupLoc] PRIMARY KEY CLUSTERED

([GroupLoc] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

insert into GroupLoc values ('00001', 10)

insert into GroupLoc values ('00002', 10)

insert into GroupLoc values ('00003', 10)

insert into GroupLoc values ('00004', 10)

insert into GroupLoc values ('All', 5)

insert into GroupLoc values ('grp1', 15)

insert into GroupLoc values ('grp2', 11)

insert into GroupLoc values ('grp3', 13)

CREATE TABLE [dbo].GroupLocLink(

[GroupLoc] [varchar](5) NOT NULL,

[Loc] [varchar](4) NOT NULL,

[Terminal] [varchar](4) NOT NULL,

CONSTRAINT [GroupLocLink_GroupLoc] PRIMARY KEY CLUSTERED

([GroupLoc] ASC,

[Loc] ASC,

[Terminal] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [GroupLocLink_Loc] ON [dbo].GroupLocLink

([Loc] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)

CREATE NONCLUSTERED INDEX [GroupLocLink_Term] ON [dbo].GroupLocLink

([Terminal] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)

insert into GroupLocLink values ('00001','0001','0001')

insert into GroupLocLink values ('00001','0001','0002')

insert into GroupLocLink values ('00002','0002','0001')

insert into GroupLocLink values ('00003','0003','0001')

insert into GroupLocLink values ('00004','0004','0001')

insert into GroupLocLink values ('grp1','0001','0001')

insert into GroupLocLink values ('grp1','0001','0002')

insert into GroupLocLink values ('grp1','0003','0001')

insert into GroupLocLink values ('grp2','0002','0001')

insert into GroupLocLink values ('grp2','0004','0001')

insert into GroupLocLink values ('grp3','0002','0001')

insert into GroupLocLink values ('grp3','0003','0001')

insert into GroupLocLink values ('grp3','0004','0001')

insert into GroupLocLink values ('All','0001','0001')

insert into GroupLocLink values ('All','0001','0002')

insert into GroupLocLink values ('All','0002','0001')

insert into GroupLocLink values ('All','0003','0001')

insert into GroupLocLink values ('All','0004','0001')

SELECT MT.loc+MT.terminal AS LocTerm,MT.Loc ,[priority], M.*

FROM

Master_TAB M JOIN

(SELECT M.UPC,L.Loc,L.terminal,M.PriceLvl,L.GroupLoc, [priority], ROW_NUMBER() OVER (PARTITION BY UPC,PriceLvl,L.LOC,terminal ORDER BY [Priority] DESC) AS rownum

FROM Master_TAB M

JOIN GroupLoc G ON M.LocGroup=G.GroupLoc

JOIN GroupLoclink L ON G.GroupLoc = L.GroupLoc

) MT ON M.UPC=MT.UPC AND M.PriceLvl=MT.PriceLvl AND M.LocGroup=MT.GroupLoc AND MT.rownum=1

ORDER BY MT.loc,MT.terminal,M.UPC;

/* CLean Up

drop table [Master_TAB]

drop table [GroupLoc]

drop table GroupLocLink

*/

drew.allen SSC Guru Points: 77014 More actions · Answer 10

sgmunson (7/10/2015)
Just curious as to whether you might be allowed to add the ROW_NUMBER() column to the table as a computed, persisted column... Might simplify things quite a bit, although you might pay for it on INSERTs. Just a thought...

I don't believe that windowed functions can be used in computed columns, although I can't find anything specifically stating that.

When I tried to create one, I got an error stating that windowed functions can only be used in a SELECT or an ORDER BY clause.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

sgmunson SSC Guru Points: 110651 More actions · Answer 11

drew.allen (7/10/2015)
sgmunson (7/10/2015)
Just curious as to whether you might be allowed to add the ROW_NUMBER() column to the table as a computed, persisted column... Might simplify things quite a bit, although you might pay for it on INSERTs. Just a thought...
I don't believe that windowed functions can be used in computed columns, although I can't find anything specifically stating that.
When I tried to create one, I got an error stating that windowed functions can only be used in a SELECT or an ORDER BY clause.
Drew

Can your column = (SELECT ROW_NUMBER() OVER(PARTITION BY somefield ORDER BY someotherfield) FROM thistablename) ?

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 12

If my memory is correct, it will complain about not being able to use subqueries in that context, and requiring a scalar expression. It makes sense, since a query like that could (and likely would) return many rows.

It is a bit strange that it isn't just like other places where subqueries can be used with expressions that expect one value, and fail only if they return more than 1 value. Probably a choice they made for simplicity's sake.

Cheers!

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 13

Firstly, ROW_NUMBER() doesn't necessarily invoke a sort. Compare the execution plan of these two queries:

SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY [UPC], [LocGroup] ORDER BY PriceLvl) FROM #Master_TAB

SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY [LocGroup], [UPC] ORDER BY PriceLvl) FROM #Master_TAB

In your case the parameters passed to ROW_NUMBER are not sufficiently aligned with an existing

index sort order for SQL Server to avoid a sort.

Secondly, the sort order of the ROW_NUMBER is different to the sort order of the outer query. Even

if the sort order of ROW_NUMBER() was aligned with an index to make a sort operation unnecessary,

you would still have the sort on the outer query. This outer query sort cost is reported in the plan to

be about 2%. If you hardcode rownum as 1, the sort to support ROW_NUMBER is eliminated from the plan

but the cost of the sort on the outer query rises to about 94%, so the gain in performance is way less

than you would expect.

Thirdly, a spill occurs when the actual number of rows being sorted exceeds the estimate. SQL Server

relies on statistics to derive an excution plan. Try updating statistics on the tables used by the query

to see of the plan changes. You could also consider alternative shapes for the query which might assist

the optimiser choose a better plan. Here's an example:

SELECT *

FROM (

SELECT

l.loc + l.terminal AS LocTerm,

l.Loc,

l.[priority],

l.terminal,

m.*,

rownum = ROW_NUMBER() OVER (PARTITION BY l.LOC, l.terminal, m.UPC, m.PriceLvl ORDER BY l.[Priority] DESC)

FROM #Master_TAB M

CROSS APPLY (

SELECT l.loc, l.terminal, g.[priority]

FROM #GroupLoc G

INNER JOIN #GroupLoclink L

ON M.LocGroup = L.GroupLoc

WHERE M.LocGroup = G.GroupLoc

) l

) d

WHERE rownum = 1

ORDER BY d.loc, d.terminal, d.UPC

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

JohnG69 Hall of Fame Points: 3114 More actions · Answer 14

Thanks for your response...

1. the execution plans of the Row_number for:

SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY [UPC], [LocGroup] ORDER BY PriceLvl) FROM #Master_TAB

SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY [LocGroup], [UPC] ORDER BY PriceLvl) FROM #Master_TAB

Tells me that the sort is invoked in the second query due to the fact that there's no index that covers [LocGroup], [UPC]...

The first of the 2 queries doesn't invoke a sort due to the fact that [UPC], [LocGroup] is covered by the PK .

2. correct me if I'm wrong or if I misunderstood, Although the execution plans of the queries are relative, For a query cost, 2 sort spills will have 2 lower costs overall, compared to 1 spill that will have 1 higher %.

conclusion is that sometimes you can't avoid the spills but you need to manage them. right?

3. RE:"Thirdly, a spill occurs when the actual number of rows being sorted exceeds the estimate."

That's one case... In my case the statistics are up to date and the estimated rows match the actual rows.

Based on the articles I read and this thread, We have to try to avoid Spills but sometimes it's just not feasible and you need to minimize the number of spills...

Is this a good conclusion?

thank you

JG