Index suggestion for VERY large table

Question

Index suggestion for VERY large table

DavidL

Hall of Fame

Points: 3582
More actions
April 8, 2010 at 9:25 am

#213619

I have a reporting table with the previous two years' history of sales for selected markets and products. The table has 125 million rows, and will grow at about 5 million rows per month. All future months will be added at the 'bottom' of the table -- that is, there will be no data for past time periods inserted. At the moment the table structure is as follows:
CREATE TABLE [dbo].[IRI](
[Geography] [varchar](100) NULL,
[Product] [varchar](100) NULL,
[AsOf] [int] NULL,
[UPC_1] [bigint] NULL,
[WtdDist_1] [decimal](16, 8) NULL,
[DSales_1] [decimal](16, 8) NULL,
[CSales_1] [decimal](16, 8) NULL,
[USales_1] [decimal](16, 8) NULL,
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[GEOID] [int] NULL,
[PRODUCTID] [int] NULL,
CONSTRAINT [PK_IRI_ID] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
In an ideal world, GeoID and ProductID would not be nullable, but loading the data took precedence, so I have a problem I will later need to solve. GeoID is fully populated, and used in the query below; ProductID has some null values, and is not used in the query.
I also am not pleased that the clustered index is the identity column.
I think I would prefer it to be a combination of GeoID (the market)/ProductID/AsOf (time period, represented as 20090117).
A typical query is:
SELECT a.Geography
,a.Product
,a.AsOf
,a.UPC_1
,a.WtdDist_1
,a.Dsales_1
,a.Csales_1
,a.Usales_1
,b.Brand
,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2)))
AS Date
FROM IRI a
JOIN #t b ON a.UPC_1=b.UPC_1_A
WHERE a.AsOf IN
(SELECT [dtInt]
FROM [model].[dbo].[Calendar]
WHERE dt BETWEEN @AsOf-30 AND @AsOf+30
)
AND Geography=@Geo
The temp table #t is loaded before with the relevant subset of products, and is typically about 3-20 line items, so is very quick.
I created one index:
CREATE NONCLUSTERED INDEX [InNc_IRI_UPCwAsOfGeoID] ON dbo.IRI
(
[UPC_1] ASC
)
INCLUDE ( [AsOf],[GEOID])
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
My reasoning was UPC_1 has the highest number of unique values, GeoID the next most, then AsOf.
MY QUESTION: Does anyone have a suggestion for an index that would be optimal for this query. Many thanks. D. Lewis

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

epriddy Old Hand Points: 343 More actions · Answer 1

Have you considered partitioning this table by date? If that's not a option then you will want either a clustered or non-clustered index on the date field. At my place of business, we recently took a table of 500 million records (and growing) and changed the clustered index to the date field which made huge improvements in most queries and didn't have major impact on other queries being run. But, for the long term strategy, we're considering breaking the table into "current data" (most recent 5 years) vs. "historical data" (all history) and then partitioning the current data.

DavidL Hall of Fame Points: 3582 More actions · Answer 2

Yes, I am considering partioning it, but I cannot begin thinking about that for at least a few months. The loading etc. of this table was so stressful (on both people and systems) that I am very reluctant to simply add an index without some good reason. I suspect, too, that an index on the date field would be 'good', but I would be thankful for some specific advice based on the query I posted.

For example, is the index with the included columns really helping things? I don't have the experience to know. Is there a better candidate given the query/table structure I posted? TIA D. Lewis

Stefan_G SSCertifiable Points: 6609 More actions · Answer 3

You should define a clustered index on GeoID, ProductID, AsOf just as you suggested yourself.

You should also change the query so the SQL server understands that you are really asking for a range in AsOf. The current join with the Calender is bad for perf.

Perhaps you could use:

WHERE a.AsOf BETWEEN @AsOf-30 AND @AsOf+30

This will probably give you pretty good perf as long as you ask for a small set of products.

To get even petter perf you should also partition the table on AsOf. Note that to gain anything from the partitioning it is vital that you change the query as I described above.

If you dont want to rebuild the clustered index right now, you could also create a noncustered index with GeoID, ProductID, AsOf. For queries that return a small number of rows this index will be very good.

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 4

Stefan_G (4/8/2010)
You should define a clustered index on GeoID, ProductID, AsOf just as you suggested yourself.
You should also change the query so the SQL server understands that you are really asking for a range in AsOf. The current join with the Calender is bad for perf.
Perhaps you could use:
WHERE a.AsOf BETWEEN @AsOf-30 AND @AsOf+30
This will probably give you pretty good perf as long as you ask for a small set of products.
To get even petter perf you should also partition the table on AsOf. Note that to gain anything from the partitioning it is vital that you change the query as I described above.
If you dont want to rebuild the clustered index right now, you could also create a noncustered index with GeoID, ProductID, AsOf. For queries that return a small number of rows this index will be very good.

Just a thought... with that particular clustered index, the addition of new rows can and will cause page splits. You sure you want to do a rebuild on 125 million rows to fix that now and again?

To prevent the page splits, save the clustered index for something that matches the temporal nature of the inserts so the new rows are always added near the logical "end" of the table. Use an unclustered index in place of the clustered index you propose.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Welsh Corgi SSC Guru Points: 116520 More actions · Answer 5

Jeff,

I just read your response and I appreciate you input.

Regards,

Welsh

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

DavidL Hall of Fame Points: 3582 More actions · Answer 6

I wondered about that (page splits) last night as I couldn't sleep. Thanks for confirming it.

In the meantime, an index on AsOf (the date INT field) improved the query performance immensely, so I have a little breathing room to work out a better table and index structure.

Thanks. D. Lewis

epriddy Old Hand Points: 343 More actions · Answer 7

Have you also considered limitingthe dates via a join instead of the where criteria? A join is typically more powerful than a between or in list.

SELECT a.Geography

,a.Product

,a.AsOf

,a.UPC_1

,a.WtdDist_1

,a.Dsales_1

,a.Csales_1

,a.Usales_1

,b.Brand

,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2) ))

AS Date

FROM IRI a

inner join (Select [dtInt] from model.dbo.Calendar where dt between @AsOf-30 AND @AsOf+30) dt

on a.AsOf = dt.dtInt

JOIN #t b ON a.UPC_1=b.UPC_1_A

WHERE Geography=@Geo

DavidL Hall of Fame Points: 3582 More actions · Answer 8

Thanks. This morning's task is to refine that date lookup method. I am going to test your join suggestion vs a CTE vs my original between. D. Lewis

DavidL Hall of Fame Points: 3582 More actions · Answer 9

Results of performance testing with and without a CTE:

Without (the question was whether a BETWEEN clause for the date would slow down things):

SELECT a.Geography

,a.Product

,a.AsOf

,a.UPC_1

,a.WtdDist_1

,a.Dsales_1

,a.Csales_1

,a.Usales_1

,b.Brand

,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2)))

AS Date

FROM IRI a

JOIN #t b ON a.UPC_1=b.UPC_1_A

WHERE a.AsOf BETWEEN @dmin AND @dmax

AND a.GeoID=@Geography

resulted in:

Table 'IRI'. Scan count 6, logical reads 477, physical reads 4, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

----------------------------------------------

With CTE:

WITH asof_cte(dtInt)

AS

(

SELECT [dtInt]

FROM [model].[dbo].[Calendar]

WHERE dt BETWEEN @AsOf-30 AND @AsOf+30

)

SELECT a.Geography

,a.Product

,a.AsOf

,a.UPC_1

,a.WtdDist_1

,a.Dsales_1

,a.Csales_1

,a.Usales_1

,b.Brand

,CONVERT(DATETIME,(SUBSTRING(CONVERT(CHAR(8),a.AsOf),5,2)+'/'+RIGHT(CONVERT(CHAR(8),a.AsOf),2)+'/'+LEFT(CONVERT(CHAR(8),a.AsOf),2)))

AS Date

FROM IRI a

JOIN #t b ON a.UPC_1=b.UPC_1_A

JOIN asof_cte c ON a.AsOf=c.dtInt

WHERE a.Geoid=@Geography

resulted in

Table 'IRI'. Scan count 6, logical reads 3500, physical reads 8, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So in this instance, the BETWEEN clause is a clear winner.

epriddy Old Hand Points: 343 More actions · Answer 10

epriddy

Old Hand

Points: 343

April 9, 2010 at 11:44 am

#1149296

Awesome! Thanks for posting!

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 11

Welsh Corgi (4/8/2010)
Jeff,
I just read your response and I appreciate you input.
Regards,
Welsh

Thanks, Welsh.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 12

Hey folks... as a sidebar to all of this, since the Integer date is in the yyyymmdd ISO format, the formula to convert the Integer date to the DATETIME datatype needn't be so complicated. The following will do the trick just fine...

DECLARE @IntDate INT

SELECT @IntDate = 20090117

SELECT CAST(CAST(@IntDate AS CHAR(8)) AS DATETIME)

There's also a very high speed method that uses no character based conversions but I can't put my hands on it just now (still looking for it).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

DavidL Hall of Fame Points: 3582 More actions · Answer 13

Ah, yes. Thanks. I suspected there was a better way...

Stefan_G SSCertifiable Points: 6609 More actions · Answer 14

Jeff Moden (4/9/2010)
There's also a very high speed method that uses no character based conversions but I can't put my hands on it just now (still looking for it).

Please post that method if you find it. Sounds interesting.

/SG