May 21, 2013 at 10:19 am
I've got a large sales table (66m rows) that I need to query in an aggregated way to return a bestseller list for specific dates and stores.
I think i'm running into an issue relating to statistics and the distribution of values, but I'm not sure....
Here is the structure of the tables involved in the query:
tblSales table structure:
[date_id] [int] NOT NULL,
[store_id] [int] NOT NULL,
[product_id] [bigint] NOT NULL,
[volume_sold] [int] NOT NULL
date_id is an int representing a specific date, store_id is the id of the store that made the sale, product_id is the id of the product that sold, volume_sold
is the... volume sold of a specific product on a specific date in a specific store. Pretty straightforward.
tblDates:
date_id int,
calendar_date datetime,
period int
calendar_date is the actual date the date_id represents, period is an int that is used to group the date_ids into weeks (week 1 = period 1, week2 = period 2, etc.)
[dbo].tblStores
[dbo].[tblStores_To_Aggregates]
[dbo].[tblAggregates]
these tables are used to assign groups of stores into aggregates that can then be queried on (ex: west coast aggregate, small stores aggregate, ALL aggregate, etc.)
to speed up the query i've created an indexed view on the data like so:
select
tblDates.period,
product_id,
[tblAggregates].aggregate_id,
SUM(volume_sold) as vol_sold,
COUNT_BIG(*) AS FREQUENCY
from
[dbo].[tblSales] sales
INNER JOIN [dbo].tblStores
ON sales.store_id = tblStores.store_id
INNER JOIN [dbo].[tblStores_To_Aggregates]
ON tblStores.store_id = [tblStores_To_Aggregates].store_id
INNER JOIN [dbo].[tblAggregates]
ON [tblStores_To_Aggregates].aggregate_id = [tblAggregates].aggregate_id
INNER JOIN [dbo].tblDates
ON tblDates.date_id = sales.date_id
group by
[tblAggregates].aggregate_id,
tblDates.period,
product_id
And then a clustered index on the view in the form:
[aggregate_id] ASC,
[period] ASC,
[product_id] ASC
And here is the query:
select top 1000
sales.product_id,
SUM(volume_sold) as vol_sold,
RANK() OVER (ORDER BY SUM(volume_sold) DESC) as product_rank
FROM
[dbo].[tblSales] sales
INNER JOIN [dbo].tblStores
ON sales.store_id = tblStores.store_id
INNER JOIN [dbo].[tblStores_To_Aggregates]
ON tblStores.store_id = [tblStores_To_Aggregates].store_id
INNER JOIN [dbo].[tblAggregates]
ON [tblStores_To_Aggregates].aggregate_id = [tblAggregates].aggregate_id
INNER JOIN [dbo].tblDates
ON tblDates.date_id = sales.date_id
WHERE
period >= 460
and period <= 470
and [tblAggregates].aggregate_id = 1
group by
product_id
order by
product_rank asc
The issue comes when varying the aggregate_id.
The aggregate could contain many stores, or only a few: aggregate_id 1 has 1878 stores. 2 has 93.
When querying on id 2, the query returns in subsecond and 1 takes 30+ seconds.
Looking at the query plan it seems like the issue in caused by a poor estimation of the amount of rows that will come back.
The plan for 2 has an estimation that is close. But for 1 it's way off. This is causing it to spill to tempdb and slow down the query.
My guess is that the problem is related to the statistics (they are up to date btw) and them giving poor cardinality estimations.
I'm kind of at a loss at what to do next. Any advice or suggestions would be appreciated.
May 23, 2013 at 12:25 pm
Have you tried turning this into a stored procedure and including the WITH RECOMPILE option when you create it? This will force the optimiser to recalc the execution plan every time the procedure runs. It may be that the optimiser is currently reusing the same plan (optimised for the smaller query) hence causing the performance hit.
One other option would be to break this one query into separate smaller query steps (if that's an option for you) E.g:
1 Select list of stores into @tablevariable
2 Select list of products by store
3 etc...
For my sins we still use a 2000 server so run into performance issues all the time. RANK is a luxury we don't have 😉
May 23, 2013 at 12:34 pm
Thanks for the reply. I tried SPs before to no avail.
Right now I've seen some performance gains by referring to the indexed view directly in the query like so:
select
product_id,
SUM(vol_sold) as vol_sold,
RANK() OVER (ORDER BY SUM(vol_sold) DESC) as product_rank
from
[dbo].[view_SalesData_Weekly] WITH (NOEXPAND)
WHERE
period >= 460
and period <= 480
and aggregate_id = 1
group by
product_id
Even though SQL was already using the indexed view in the query i posted before, it appears that by specifying the view directly that more accurate statistics are found. I suspect that the joins are what was causing the issue.
May 23, 2013 at 1:02 pm
Ah, didn't realise you weren't already referencing the view directly. The optimiser is smart, but sometimes it does need a helping hand. It usually finds reasons not to use a materialised view (hence the no expand option) rather than always seeking them out.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply