April 8, 2010 at 9:25 am
I have a reporting table with the previous two years' history of sales for selected markets and products. The table has 125 million rows, and will grow at about 5 million rows per month. All future months will be added at the 'bottom' of the table -- that is, there will be no data for past time periods inserted. At the moment the table structure is as follows:
CREATE TABLE [dbo].[IRI](
[Geography] [varchar](100) NULL,
[Product] [varchar](100) NULL,
[AsOf] [int] NULL,
[UPC_1] [bigint] NULL,
[WtdDist_1] [decimal](16, 8) NULL,
[DSales_1] [decimal](16, 8) NULL,
[CSales_1] [decimal](16, 8) NULL,
[USales_1] [decimal](16, 8) NULL,
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[GEOID] [int] NULL,
[PRODUCTID] [int] NULL,
CONSTRAINT [PK_IRI_ID] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
In an ideal world, GeoID and ProductID would not be nullable, but loading the data took precedence, so I have a problem I will later need to solve. GeoID is fully populated, and used in the query below; ProductID has some null values, and is not used in the query.
I also am not pleased that the clustered index is the identity column.
I think I would prefer it to be a combination of GeoID (the market)/ProductID/AsOf (time period, represented as 20090117).
A typical query is:
SELECT a.Geography
,a.Product
,a.AsOf
,a.UPC_1
,a.WtdDist_1
,a.Dsales_1
,a.Csales_1
,a.Usales_1
,b.Brand
,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2)))
AS Date
FROM IRI a
JOIN #t b ON a.UPC_1=b.UPC_1_A
WHERE a.AsOf IN
(SELECT [dtInt]
FROM [model].[dbo].[Calendar]
WHERE dt BETWEEN @AsOf-30 AND @AsOf+30
)
AND Geography=@Geo
The temp table #t is loaded before with the relevant subset of products, and is typically about 3-20 line items, so is very quick.
I created one index:
CREATE NONCLUSTERED INDEX [InNc_IRI_UPCwAsOfGeoID] ON dbo.IRI
(
[UPC_1] ASC
)
INCLUDE ( [AsOf],[GEOID])
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
My reasoning was UPC_1 has the highest number of unique values, GeoID the next most, then AsOf.
MY QUESTION: Does anyone have a suggestion for an index that would be optimal for this query. Many thanks. D. Lewis
April 8, 2010 at 9:42 am
Have you considered partitioning this table by date? If that's not a option then you will want either a clustered or non-clustered index on the date field. At my place of business, we recently took a table of 500 million records (and growing) and changed the clustered index to the date field which made huge improvements in most queries and didn't have major impact on other queries being run. But, for the long term strategy, we're considering breaking the table into "current data" (most recent 5 years) vs. "historical data" (all history) and then partitioning the current data.
April 8, 2010 at 9:49 am
Yes, I am considering partioning it, but I cannot begin thinking about that for at least a few months. The loading etc. of this table was so stressful (on both people and systems) that I am very reluctant to simply add an index without some good reason. I suspect, too, that an index on the date field would be 'good', but I would be thankful for some specific advice based on the query I posted.
For example, is the index with the included columns really helping things? I don't have the experience to know. Is there a better candidate given the query/table structure I posted? TIA D. Lewis
April 8, 2010 at 11:07 am
You should define a clustered index on GeoID, ProductID, AsOf just as you suggested yourself.
You should also change the query so the SQL server understands that you are really asking for a range in AsOf. The current join with the Calender is bad for perf.
Perhaps you could use:
WHERE a.AsOf BETWEEN @AsOf-30 AND @AsOf+30
This will probably give you pretty good perf as long as you ask for a small set of products.
To get even petter perf you should also partition the table on AsOf. Note that to gain anything from the partitioning it is vital that you change the query as I described above.
If you dont want to rebuild the clustered index right now, you could also create a noncustered index with GeoID, ProductID, AsOf. For queries that return a small number of rows this index will be very good.
April 8, 2010 at 9:34 pm
Stefan_G (4/8/2010)
You should define a clustered index on GeoID, ProductID, AsOf just as you suggested yourself.You should also change the query so the SQL server understands that you are really asking for a range in AsOf. The current join with the Calender is bad for perf.
Perhaps you could use:
WHERE a.AsOf BETWEEN @AsOf-30 AND @AsOf+30
This will probably give you pretty good perf as long as you ask for a small set of products.
To get even petter perf you should also partition the table on AsOf. Note that to gain anything from the partitioning it is vital that you change the query as I described above.
If you dont want to rebuild the clustered index right now, you could also create a noncustered index with GeoID, ProductID, AsOf. For queries that return a small number of rows this index will be very good.
Just a thought... with that particular clustered index, the addition of new rows can and will cause page splits. You sure you want to do a rebuild on 125 million rows to fix that now and again?
To prevent the page splits, save the clustered index for something that matches the temporal nature of the inserts so the new rows are always added near the logical "end" of the table. Use an unclustered index in place of the clustered index you propose.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 8, 2010 at 9:51 pm
Jeff,
I just read your response and I appreciate you input.
Regards,
Welsh
For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
April 9, 2010 at 8:46 am
I wondered about that (page splits) last night as I couldn't sleep. Thanks for confirming it.
In the meantime, an index on AsOf (the date INT field) improved the query performance immensely, so I have a little breathing room to work out a better table and index structure.
Thanks. D. Lewis
April 9, 2010 at 9:39 am
Have you also considered limitingthe dates via a join instead of the where criteria? A join is typically more powerful than a between or in list.
SELECT a.Geography
,a.Product
,a.AsOf
,a.UPC_1
,a.WtdDist_1
,a.Dsales_1
,a.Csales_1
,a.Usales_1
,b.Brand
,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2) ))
AS Date
FROM IRI a
inner join (Select [dtInt] from model.dbo.Calendar where dt between @AsOf-30 AND @AsOf+30) dt
on a.AsOf = dt.dtInt
JOIN #t b ON a.UPC_1=b.UPC_1_A
WHERE Geography=@Geo
April 9, 2010 at 10:00 am
Thanks. This morning's task is to refine that date lookup method. I am going to test your join suggestion vs a CTE vs my original between. D. Lewis
April 9, 2010 at 11:42 am
Results of performance testing with and without a CTE:
Without (the question was whether a BETWEEN clause for the date would slow down things):
SELECT a.Geography
,a.Product
,a.AsOf
,a.UPC_1
,a.WtdDist_1
,a.Dsales_1
,a.Csales_1
,a.Usales_1
,b.Brand
,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2)))
AS Date
FROM IRI a
JOIN #t b ON a.UPC_1=b.UPC_1_A
WHERE a.AsOf BETWEEN @dmin AND @dmax
AND a.GeoID=@Geography
resulted in:
Table 'IRI'. Scan count 6, logical reads 477, physical reads 4, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
----------------------------------------------
With CTE:
WITH asof_cte(dtInt)
AS
(
SELECT [dtInt]
FROM [model].[dbo].[Calendar]
WHERE dt BETWEEN @AsOf-30 AND @AsOf+30
)
SELECT a.Geography
,a.Product
,a.AsOf
,a.UPC_1
,a.WtdDist_1
,a.Dsales_1
,a.Csales_1
,a.Usales_1
,b.Brand
,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2)))
AS Date
FROM IRI a
JOIN #t b ON a.UPC_1=b.UPC_1_A
JOIN asof_cte c ON a.AsOf=c.dtInt
WHERE a.Geoid=@Geography
resulted in
Table 'IRI'. Scan count 6, logical reads 3500, physical reads 8, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
So in this instance, the BETWEEN clause is a clear winner.
April 9, 2010 at 11:44 am
Awesome! Thanks for posting!
April 9, 2010 at 5:17 pm
Welsh Corgi (4/8/2010)
Jeff,I just read your response and I appreciate you input.
Regards,
Welsh
Thanks, Welsh.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 9, 2010 at 5:20 pm
Hey folks... as a sidebar to all of this, since the Integer date is in the yyyymmdd ISO format, the formula to convert the Integer date to the DATETIME datatype needn't be so complicated. The following will do the trick just fine...
DECLARE @IntDate INT
SELECT @IntDate = 20090117
SELECT CAST(CAST(@IntDate AS CHAR(8)) AS DATETIME)
There's also a very high speed method that uses no character based conversions but I can't put my hands on it just now (still looking for it).
--Jeff Moden
Change is inevitable... Change for the better is not.
April 12, 2010 at 8:48 am
Ah, yes. Thanks. I suspected there was a better way...
April 12, 2010 at 8:54 am
Jeff Moden (4/9/2010)
There's also a very high speed method that uses no character based conversions but I can't put my hands on it just now (still looking for it).
Please post that method if you find it. Sounds interesting.
/SG
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply