Query Performance on huge view

Question

Post reply

Query Performance on huge view

gtschech

Ten Centuries

Points: 1351
More actions
June 14, 2010 at 5:51 am

#222655

Hi,
I have some tables that are used as fact tables for a SSAS cube.
For space reasons, the fact table is normalized into two tables, and two dimension tables that are referenced via foreign keys.
All tables have one column 'Version' that contains an integer that is unique.
Both fact tables have only one clustered index that contains only this version field.
The dimension tables have at least one key where the version is the first field.
The Cube uses a view to access the data that is a JOIN between all fact and dimension tables mentioned above. For selection, the Cube uses a 'where Version = x' restriction on the view.
The result from the view contains about 4 million records for each version, whereas the two fact tables have app. 60k records each and the dimension tables have 4k and 60k records, resp.
My problem is that the view is extremely slow when it comes to queries like
SELECT COUNT(1) FROM View WHERE Version=@version
The estimated execution plan tells me that it would use onjly index seeks, parallelism, bitmaps and hash matches.
But when it comes to the actual execution plan, it tells me about loops that it used and an enormous amount of intermediate results (up to 30 billion rows).
What can I do to improve the view's performance?
Guenter from Frankfurt / Germany

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

steveb. SSC-Forever Points: 46733 More actions · Answer 1

Are you able to post the code for the view and the actual execution plan?

How many distinct values of version make up the clustered key on the fact table?

Also using Count(1) serves no advantage as the query optimizer converts this to a Count(*)

gtschech Ten Centuries Points: 1351 More actions · Answer 2

Hi Steve,

all tables for the view are made up of about 100 distinct versions; all are loaede together; versions are always loaded in full.

The view is (about) as follows:

select

sen.DatumID

,sen.PositionID

,sen.Riskofaktor1ID

,sen.Riskofaktor2ID

,pos.InstrumentID

,sen.GeschaeftID as RisikofaktorGeschaeftID

,pos.GeschaeftID

,pos.KategorieID

,pos.PortfolioID

,sen.SensiTypID

-- more columns from sen

,pa.AggregationID

from

dbo.FaktSensitivitaet AS sen

inner join dbo.FaktPosition pos on sen.PositionID = pos.PositionID and sen.DatumID = pos.DatumID

inner join dbo.ZuoPositionAggregation pa on pos.PositionID = pa.PositionID and pos.DatumID = pa.DatumID

inner join dbo.ZuoRisikofaktorAggregation ra on sen.Riskofaktor1ID = ra.RisikofaktorID and sen.DatumID = ra.DatumID

and ra.AggregationID = pa.AggregationID and ra.DatumID = pa.DatumID

The underlying tables contain roughly the following number of records per version (DatumID):

FaktPosition 65.000
FaktSensitivitaet 400.000
ZuoPositionAggregation 3.8 Mio
ZuoRisikofaktorAggregation 1.8 Mio

The primary use for the view is for Cube partition processing, where the partition query statement is something like

select * from dbo.VWFaktSensitivitaetAggregation where DatumID <= 2010053199 and DatumID > 2010052700

This query starts delivering records from the first second on.

To examine if there is any data in the partition, I take the query from the cube definition and put a wrapper around it:

select top 1 1 from

(

select * from dbo.VWFaktSensitivitaetAggregation where DatumID=2010053100

) X

This produces the first execution plan (Execution Plan ViewSensiAgg TOP 1.sqlplan). Strange for me is that it takes about 3 Minutes before it delivers it's only "1" as a result.

Even stranger: if I modify my query with a "TOP 1000"

select top 1000 1 from

(

select * from dbo.VWFaktSensitivitaetAggregation where DatumID=2010053100

) X

it is significantly faster and produces the second execution plan (Execution Plan ViewSensiAgg TOP 1000.sqlplan).

Unfortunately, it seems that with growing data in the tables even the "top 1000" query takes minutes to respond...

Guenter

Stefan_G SSCertifiable Points: 6609 More actions · Answer 3

The problem here seems to be caused by bad statistics.

The seek from ZuoPositionAggregation has estimated rows = 18, but actual rows=113566

The seek from ZouRisikoFactorAggregation has estimated rows=1.8 M but actual rows=14.5 M

This results in a very bad estimate on the number of rows on the bottom input to the nested loop:

estimated rows=1.8M, actual rows=2165 M

So, to get a better execution plan you should make sure that the statistics on ZuoPositionAggregation and ZouRisikoFactorAggregation are fully updated on the DatumID column.

Bhuvnesh SSC Guru Points: 59351 More actions · Answer 4

have you updated the statistics ?

i guess you need to update the statistics for

ZuoPositionAggregation 3.8 Mio

ZuoRisikofaktorAggregation 1.8 Mio

-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

gtschech Ten Centuries Points: 1351 More actions · Answer 5

Thankx for your responses.

I have seen the bad statistics you pointed me on.

Until that I believed I had updated the statistics; between every load into the database and processing of the cube the following stored procedure is called:

ALTER PROCEDURE [dbo].[up_UpdateStatisticsOnFactTables]

with execute as self -- needs elevated rights, so use the installer user

AS

BEGIN

update statistics dbo.ZuoPositionAggregation WITH SAMPLE 30000 ROWS, ALL, NORECOMPUTE

update statistics dbo.ZuoRisikofaktorAggregation WITH SAMPLE 30000 ROWS, ALL, NORECOMPUTE

update statistics dbo.FaktPosition WITH SAMPLE 30000 ROWS, ALL, NORECOMPUTE

update statistics dbo.FaktSensitivitaet WITH SAMPLE 30000 ROWS, ALL, NORECOMPUTE

-- and all the other fact tables

END

I hoped that would do the job...

What's wrong with it?

Stefan_G SSCertifiable Points: 6609 More actions · Answer 6

gtschech (6/15/2010)
Thankx for your responses.
I have seen the bad statistics you pointed me on.
Until that I believed I had updated the statistics; between every load into the database and processing of the cube the following stored procedure is called:
ALTER PROCEDURE [dbo].[up_UpdateStatisticsOnFactTables]
with execute as self -- needs elevated rights, so use the installer user
AS
BEGIN
update statistics dbo.ZuoPositionAggregation WITH SAMPLE 30000 ROWS, ALL, NORECOMPUTE
update statistics dbo.ZuoRisikofaktorAggregation WITH SAMPLE 30000 ROWS, ALL, NORECOMPUTE
update statistics dbo.FaktPosition WITH SAMPLE 30000 ROWS, ALL, NORECOMPUTE
update statistics dbo.FaktSensitivitaet WITH SAMPLE 30000 ROWS, ALL, NORECOMPUTE
-- and all the other fact tables
END
I hoped that would do the job...
What's wrong with it?

It is possible that sample 30000 rows is not enough.

Try using FULLSCAN instead and see if it helps.

The tables are fairly small so the time required for a FULLSCAN should probably be acceptable.

gtschech Ten Centuries Points: 1351 More actions · Answer 7

Stefan and Bhuvnesh,

your guess was right. I did an UPDATE STATISTICS WWIT FULLSCAN and the query runs again.

Thanks for the hint - now I know a sample size of 30.000 is obviously too small.

Unfortunately, Stefans remark "the tables are fairly small" does not get the point - the Zuo-Tables are 40 rsp. 30 GB each, containing 735 mio and 350 mio records, so the update took about one hour...

So I have to find a way to update the statistic quicker, but with the same quality.

Stefan_G SSCertifiable Points: 6609 More actions · Answer 8

I see.

I never understood what you meant with your "Version"

I now understand that the number of rows you were talking about originally should be multiplied by the number of versions.

I suppose you need to find a balance between statistics accuracy and time to update the statistics.

gtschech Ten Centuries Points: 1351 More actions · Answer 9

Hi,

I started playing with the sample size. But even the 30.000 sample size now gives a super performance for my query...

It seems I have to delete or some sql server cache before requerying?

Besides that, I played with the TOP operator.

It looks very strange...

When I do a TOP 1000 on the query, the execution plan show me an expected number of 400 records on the FaktPosition table, whereas the actual number returned is 133879.

When I do a TOP 100 on the very same query, the execution plan expects 1.200.000 records, but gets 17.438.000!

These figures are the more strange as the query only filters by DatumID, so it should expect AND deliver the very same number of records for both approaches.

To make this even better: the actual number of rows in the FaktPosition table that match the DatumID used are 6.800.000 ...

SQL Server seems to pick randomly data :ermm:

I have attached the SQLPalns for the TOP 100 and TOP 1000 query, resp.

Stefan_G SSCertifiable Points: 6609 More actions · Answer 10

I think that you have to accept that in this case it is very difficult for the server to get good estimates of the number of rows needed.

When you use TOP 100 the server tries to estimate how many loop iterations will be needed to get 100 rows in the result set. If the guess is wrong, the results might get very bad.

Estimating the number of rows returned by a single condition is fairly easy with good statistics. It is much more difficult to estimate the number of rows resulting from a join between two large tables. It gets even more difficult when there are more than two tables involved.

Maybe you should try another approach.

If I understand you correctly you are only really trying to detect if there is data for a certain date in all the tables.

Perhaps you could just use a count of the number of matching rows based on the date in each source table. At least this will be very fast since you have an index on the DatumID column.

The problem is caused by the join of three really big tables.

gtschech Ten Centuries Points: 1351 More actions · Answer 11

Stefan,

i'm afraid you're right.

The bad thing is that using my approach is the only way to be somewhat generic about the view. I am using this inside a custom SSIS component that creates partitions in a cube. The only information about the underlying data is the partition query of the cube partition, and that is just the Select * from View. The component has no idea about how the view is structured, not even if it's a view anyway; for other facts the respective partition goes directly to a table...

So it comes to the old fact - don't be generic when it comes to performance 🙁

Nevertheless - many thanks for your suggestions!

LutzM SSC Guru Points: 107049 More actions · Answer 12

Günter,

did you try using an EXISTS query to see if there are any data in your view?

Something like

IF EXISTS (select 1 from dbo.VWFaktSensitivitaetAggregation where DatumID=2010053100) ...

Since all you want to do is to check for existence I'd give it a try and compare the duration with the other options.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Bhuvnesh SSC Guru Points: 59351 More actions · Answer 13

So finally you got some improvement after updating statistics. right ?

-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

gtschech Ten Centuries Points: 1351 More actions · Answer 14

Yes,

updating the statistics did help.

But I wonder if maybe I should switch to an indexed view for better View retrieval performance.

Have you ever done that?

But maybe I should first browse the site 😉

tnx for your suggestions!