current and time-expired rows.

Question

Post reply

current and time-expired rows.

ben.brugman

SSChampion

Points: 13369
More actions
February 18, 2016 at 9:01 am

#324427

In the topic:
overlapping date ranges
ChrisM@Work suggested 'starting a new thread', this is the new thread.
With the following data
--=====================================================================================================================
-- Create a 10 million row test table.
--=====================================================================================================================
--===== If the test table exists, drop it to make reruns in SSMS easier
SET NOCOUNT ON;
IF OBJECT_ID(N'TestTable') IS NOT NULL
DROP TABLE TestTable
;
GO
--===== Create and populate a 10 million row test table on the fly.
-- 1,000,000 random IDs with ~10 random date spans of 0 to 14 days each
Declare @widestring varchar(4000) = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
SET @widestring = @widestring+@widestring+@widestring+@widestring
SET @widestring = @widestring+@widestring+@widestring+@widestring
-- print datalength(@widestring)
;
WITH
cteGenDates AS
(
SELECT TOP 10000000 -- 10E6 rows
SubscriberID = ABS(CHECKSUM(NEWID()))%1000000+1
,StartDate = DATEADD(dd,ABS(CHECKSUM(NEWID()))%DATEDIFF(dd,'2000','20160310'),'2000')
,Span = ABS(CHECKSUM(NEwID()))%(15) -- period max 15 days.
,realism = ABS(CHECKSUM(NEwID()))%(36) -- period max 15 days.
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
)
SELECT SubscriptionID = IDENTITY(INT,1,1)
,SubscriberID
,StartDate
,EndDate = DATEADD(dd,Span,StartDate)
,substring(@widestring,ABS(CHECKSUM(NEWID()))%400, 999) wide
INTO TestTable
FROM cteGenDates
ORDER BY (StartDate + realism) --For a bit a realism
;
update TestTable set EndDate = NULL where EndDate > '20160301' and SubscriberID%3 = 1
-- The basis for this data was supplied by Jef Moden.
-- Changes:
-- Entries are in the past or the close future..
-- The period to a subscription is shortened.
-- A field (wide) is added to make each row much wider.
-- A 'small' number of (near) rows have the enddate set to NULL.
-- This is testing data. In the real data the average length of the period is about the same, but -- the spread of the period is far greater and the maximum length of the period can be years.
-- Also some 'older' entries have a NULL for the enddate.
-- Actual data does have a bit more variation.
The basis for this data was supplied by Jef Moden.
In real life the data has more variations. (E.g. much longer periods).
The query to optimise is :
select sum(datalength(wide)) Some_field_information from testtable
where StartDate < '20160301' and (EndDate >= '20160227' or EndDate is NULL)
-- Some operation on a field from the maintable.
-- Important part of the query is the selection on start and end.
-- In real life there is more complexity. (More tables, more fields etc.)
But also:
select sum(datalength(wide)) Some_field_information from testtable
where StartDate < '20060301' and (EndDate >= '20060227' or EndDate is NULL)
Performance is measured in two metrics. The total time and the amount of cache that is used to solve this query.
Code I used for this was:
SET STATISTICS TIME,IO ON;
dbcc dropcleanbuffers -- Clear the cache.
dbcc freeproccache -- Clear the proccache. (?)
DECLARE @starttime datetime = getdate()
-----------------------------------The query to optimize.--------------------------------------
select sum(datalength(wide)) Some_field_information from testtable
where StartDate < '20160301' and (EndDate >= '20160227' or EndDate is NULL)
--------------------------------------------------------------------------------------------------
-- Show time and the use of cache.
SELECT '--' [--]
,DB_NAME() AS [Database Name]
,CAST(COUNT(*) * 8/1024.0 AS DECIMAL (10,2)) AS [Cached Size (MB)]
,convert(float,(GETDATE() - @starttime))*60*60*24 duration
FROM sys.dm_os_buffer_descriptors WITH (NOLOCK)
WHERE DB_NAME(database_id) = DB_NAME()
GROUP BY DB_NAME(database_id)
ORDER BY [Cached Size (MB)] DESC OPTION (RECOMPILE);
Without indexes the query takes about 30 seconds and the cache is filled in my environment.
With indexes the time gets shorter and the amount of used cache becomes less.
I am still working on variations with indexes and variations of the query, to see the effects of this. I'll publish the results from this a bit later on.
In the real database, most tables have a clustered index, often starting with the subscriber.
This clustered index is used a lot. Sometimes there are extra indexes as wel. So often another clustered index is not an option.
Question:
What are good queries/techniques/indexes, for a table with start and enddates, where the selection is often around the recent date, and occasionally a period in the past ?
Ben

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

Preliminary results:

create index ix_helper01 ON #TestTable (StartDate,EndDate) INCLUDE (wide)

create index ix_helper02 ON #TestTable (EndDate,StartDate) INCLUDE (wide)

DECLARE @Now DATETIME = GETDATE()

select sum(datalength(wide)) Some_field_information

from #testtable WITH (INDEX (ucx_Stuff))

where StartDate < '20160301' and (EndDate >= '20160227' or EndDate is NULL)

SELECT DATEDIFF(MILLISECOND, @Now,GETDATE()) -- 390

DECLARE @Now DATETIME = GETDATE()

select sum(datalength(wide)) Some_field_information

from #testtable WITH (INDEX (ix_helper01))

where StartDate < '20160301' and (EndDate >= '20160227' or EndDate is NULL)

SELECT DATEDIFF(MILLISECOND, @Now,GETDATE()) -- 310

DECLARE @Now DATETIME = GETDATE()

select sum(datalength(wide)) Some_field_information

from #testtable WITH (INDEX (ix_helper02))

where StartDate < '20160301' and (EndDate >= '20160227' or EndDate is NULL)

SELECT DATEDIFF(MILLISECOND, @Now,GETDATE()) -- 16

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ben.brugman SSChampion Points: 13369 More actions · Answer 2

Created and used indexes.

After the test with an index the index is removed.

1 create index IX_forced ON TestTable (StartDate, EndDate) INCLUDE (wide)

2 create index IX_forced ON TestTable (EndDate,StartDate) INCLUDE (wide)

3 create index IX_forced ON TestTable (EndDate,StartDate)

4 CREATE UNIQUE INDEX IX_forced ON TestTable(EndDate, StartDate ,SubscriberID, SubscriptionID);

5 CREATE CLUSTERED INDEX IX_forced ON dbo.TestTable(EndDate, StartDate, SubscriberID, SubscriptionID)

6 CREATE UNIQUE INDEX IX_forced ON TestTable(EndDate, StartDate ,SubscriberID, SubscriptionID)

Select select

Build index 2016 2006 reserved

time cache time cache time

1 193.377 2480.93 29.74 946.78 13.59 5249184

2 187.013 8.55 0.18 1542.70 17.07 5249120

3 30.606 19.76 0.43 203.57 2.37 3004376

4 44.746 7.02 0.36 233.62 3.09 3082968

5 190.77 8.94 0.22 1577.47 20.19 2594832

6 44.66 6.86 0.37 233.45 2.76 2960536

Build index time, is the time to build the index in seconds

2016 cache is the amount of cache used after the query in MB.

2016 time is the time used to execute the query in seconds.

2006 cache is the amount of cache used after the query in MB.

2006 time is the time used to execute the query in seconds.

reserved is the amount of data reserved for the table in KB.

Wide field.

This was added to the 'original' table. To get a wider (more realistic) rowsize.

In the selection it was added so that the selection had to read the main rows.

(The remark for this was lost in editing the first message. Oops, sorry).

2016

The selection in 2016 is near the end of all periods. This represents a large number of queries.

The majority of the queries will be done around the current date.

The performance of this query will probably suffer if the startdate can be NULL.

2006

Is the same length of period as the 2016 selection. But is some time in the past.

It represents any query about a period in the past. This can be a few days, a week, a month.

The performance of this query suffers because of the end of the period can be NULL.

The performance of this query will also suffer if long subscribtions are possible.

A full covering index, or a clustered table.

Specifically a covering index or a clustered table will result in more cache usage for the 2006 queries.

If similar queries are present but use a different index there probably will be more cache usage.

Usage of cache was stated as an important metric.

Number 6

The index for number 6 is the same as for number four the the querie was 'divided' in two distinct queries.

select sum(datalength(wide)) Some_field_information from testtable WITH (INDEX (IX_forced))

where StartDate < '20060301' and (EndDate >= '20060227')

--

select sum(datalength(wide)) Some_field_information from testtable WITH (INDEX (IX_forced))

where StartDate < '20060301' and (EndDate is NULL)

This split did not make a significant difference.

Selections over a period in the past remain 'slow' (?).

Are there improvements to make the selection over a period in the past faster ?

Ben

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

The fastest index for the 2006 query was index 3, which isn't covering. Can you post the execution plan you get for this?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ben.brugman SSChampion Points: 13369 More actions · Answer 4

Here is the 'showplan', BUT, see below:

StmtText

|--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [Expr1013]=(0) THEN NULL ELSE [Expr1014] END))

|--Stream Aggregate(DEFINE:([Expr1013]=COUNT_BIG([Expr1005]), [Expr1014]=SUM([Expr1005])))

|--Compute Scalar(DEFINE:([Expr1005]=datalength([TestDb].[dbo].[TestTable].[wide])))

|--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1000], [Expr1012]) WITH UNORDERED PREFETCH)

|--Sort(ORDER BY:([Expr1011] ASC))

| |--Compute Scalar(DEFINE:([Expr1011]=BmkToPage([Bmk1000])))

| |--Index Scan(OBJECT:([TestDb].[dbo].[TestTable].[IX_forced]))

|--RID Lookup(OBJECT:([TestDb].[dbo].[TestTable]), SEEK:([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD)

I did not add the :(:(:( symbols these should be read as : (. 😛

Because of further testing, the metrics have changed.

2016 is now 862.25 MB and 38.69 seconds

2006 is now 976.39 MB and 39.5067

Why ?

In the first message I wrote:

In the real database, most tables have a clustered index, often starting with the subscriber

-- Some operation on a field from the maintable.

So I tried to emulate that with a clustered index on SubscriberID,SubscriptionID.

The timings for this where 'bad'. Even when removing the clustered index the timing remained bad.

So I am looking for improving these results, How ?

Suggestions are welcome.

I'll rebuild to 'original' table I tested with for situation 3. And show the plan for that.

But I am looking for improvement for a 'clustered' table. But need a testset where the spread of data is more like the real dataset. (I need some time to think of something clever to get a more realistic spread of the subscriber data over time).

Ben

Our situation:

Most operations on the database are subscriber 'centric'. Once a subscriber enters our facility he/she gets a number of subscriptions. So there is a natural clustering of subscriptions on time and subscriber. So we cluster all tables on subscriber (and subscription or another identity).

Most operations are only on a limited number of subscribers, so clustering on subscriber works very wel. Because data is clustered on subscriber, data for a subscriber is 'compact' and not spread over a table. Resulting in a small number of discaccesses and a small cache footprint for almost all operations.

So friday I was getting very hopefull results. This was up to the moment I introduced the clustering. Because the clustering works so wel for most of our operations we do not want to leave this technique.

The current testset does not represent the actual situation, because of the natural clustering of data on subscriber and time.

For clarification only:

Once a subscriber enters our 'resort', the subscriber uses the facilities. Some subscribers come only once. Some subscribers return every year one or a few times. Some subscribers come for example most weekends (but use less of the resources we offer).

10 to 20 percent of the subscribers are responsible for over 80 percent of the data.

ben.brugman SSChampion Points: 13369 More actions · Answer 5

Rebuild the table from scratch.

Timing with No 3.

'No 3 create index IX_forced ON TestTable (EndDate,StartDate) NO CLUSTER'

2016

Cache 20.04 Mb

time 0.4066 secs

Showplan (Execution plan is added a a file, see down vvvv).

StmtText

select sum(datalength(wide)) Some_field_information from testtable WITH (INDEX (IX_forced)) where StartDate < '20160301' and (EndDate >= '20160227' or EndDate is NULL)

StmtText

|--Compute Scalar(DEFINE: ([Expr1004]=CASE WHEN [Expr1026]=(0) THEN NULL ELSE [Expr1027] END))

|--Stream Aggregate(DEFINE: ([Expr1026]=COUNT_BIG([Expr1005]), [Expr1027]=SUM([Expr1005])))

|--Compute Scalar(DEFINE: ([Expr1005]=datalength([TestDb].[dbo].[TestTable].[wide])))

|--Nested Loops(Inner Join, OUTER REFERENCES: ([Bmk1000], [Expr1025]) OPTIMIZED WITH UNORDERED PREFETCH)

|--Compute Scalar(DEFINE: ([Expr1024]=BmkToPage([Bmk1000])))

| |--Nested Loops(Inner Join, OUTER REFERENCES: ([Expr1018], [Expr1019], [Expr1020]))

| |--Merge Interval

| | |--Sort(TOP 2, ORDER BY: ([Expr1021] DESC, [Expr1022] ASC, [Expr1018] ASC, [Expr1023] DESC))

| | |--Compute Scalar(DEFINE: ([Expr1021]=((4)&[Expr1020]) = (4) AND NULL = [Expr1018], [Expr1022]=(4)&[Expr1020], [Expr1023]=(16)&[Expr1020]))

| | |--Concatenation

| | |--Compute Scalar(DEFINE: ([Expr1013]=NULL, [Expr1014]=NULL, [Expr1012]=(60)))

| | | |--Constant Scan

| | |--Compute Scalar(DEFINE: ([Expr1016]='2016-02-27 00:00:00.000', [Expr1017]=NULL, [Expr1015]=(22)))

| | |--Constant Scan

| |--Index Seek(OBJECT: ([TestDb].[dbo].[TestTable].[IX_forced]), SEEK: ([TestDb].[dbo].[TestTable].[EndDate] > [Expr1018] AND [TestDb].[dbo].[TestTable].[EndDate] < [Expr1019]), WHERE: ([TestDb].[dbo].[TestTable].[StartDate]<'2016-03-01 00:00:00.000') ORDERED FORWARD)

|--RID Lookup(OBJECT: ([TestDb].[dbo].[TestTable]), SEEK: ([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD)

2006

Cache 203.94 Mb

Time 2.52667 Secs

Showplan:

StmtText

select sum(datalength(wide)) Some_field_information from testtable WITH (INDEX (IX_forced)) where StartDate < '20060301' and (EndDate >= '20060227')

StmtText

|--Compute Scalar(DEFINE: ([Expr1004]=CASE WHEN [Expr1013]=(0) THEN NULL ELSE [Expr1014] END))

|--Stream Aggregate(DEFINE: ([Expr1013]=COUNT_BIG([Expr1005]), [Expr1014]=SUM([Expr1005])))

|--Compute Scalar(DEFINE: ([Expr1005]=datalength([TestDb].[dbo].[TestTable].[wide])))

|--Nested Loops(Inner Join, OUTER REFERENCES: ([Bmk1000], [Expr1012]) WITH UNORDERED PREFETCH)

|--Sort(ORDER BY: ([Expr1011] ASC))

| |--Compute Scalar(DEFINE: ([Expr1011]=BmkToPage([Bmk1000])))

| |--Index Seek(OBJECT: ([TestDb].[dbo].[TestTable].[IX_forced]), SEEK: ([TestDb].[dbo].[TestTable].[EndDate] >= '2006-02-27 00:00:00.000'), WHERE: ([TestDb].[dbo].[TestTable].[StartDate]<'2006-03-01 00:00:00.000') ORDERED FORWARD)

|--RID Lookup(OBJECT: ([TestDb].[dbo].[TestTable]), SEEK: ([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD)

My Current problem is that these results do not reproduce when the table is clustered on subscriberid and subscriptionid. For normal operation this clustering does produce good results.

Now I need a 'good' performance on queries equivalent to the 2016 query.

(It would be nice to have a 'good' performancen on queries equivalent to the 2006 query).

Materialised view

Still thinking about materialised views which only contain 'recent' subscriptions so that a 2016 query can be performed fast. (Probably not covering, because then we have to include all columns, we can not predict which columns will be used).

Distinct queries for the 2016 and 2006 queries is not preferable, but if this is the only solution it would be acceptable.

Ben

(Edit executionplan added as a file).

(Current testing environment:

-- SQL Server 10.50.2550.0 - SP1 (Developer Edition (64-bit))

)

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 6

Hi Ben

A few questions for you based on the information you've posted so far.

The model system you have posted - does this model a production system or a system in development?

Assuming it's in development:

How representative of the queries in the system is the standard query you're using for testing? This one:

select sum(datalength(wide)) Some_field_information from testtable

where StartDate < '20160301' and (EndDate >= '20160227' or EndDate is NULL)

I'm asking this because you mention in the text that subscriberid will almost always be one of the searched columns, yet it's missing from the standard query.

There's a somewhat "shotgun" approach to indexing; with and without a clustered index, non-covering ordinary indexes, etc. If you don't already have a clustered index on the table, then create one and leave it while you experiment with non-clustered indexes. Your choice of keys for the clustered index is important. If the table has an ID column for this purpose (SubscriptionID?) then a relatively safe option would be to use it for the cluster key. If not (or if you prefer natural keys) then a likely candidate based upon your posts so far might be subscriberid, startdate, enddate or subscriberid, enddate, startdate.

Assuming that your table is in a production db, then there are better options than testing a random query against an artificial copy of it: you can examine the usage of existing indexes and the logging of missing indexes from the production db using some very standard methods. Armed with the information they provide, your indexing strategy is a lot less hit or miss and probably much faster too. Here's[/url] a great place to start.

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

ben.brugman SSChampion Points: 13369 More actions · Answer 7

We have a number of 'models':

Legacy: Build in house the operating system and the DBMS, running since 1973.

SQL-server 1: In house designed/build database. (From the second part of the 90ties)

SQL-server 2: Designed and build outside 'our' (my) control.

Legacy.

(Started in 1973 on a PDP-11, then supporting 400 concurrent users. Later on a VAX, now under Linux)

Still in production, still maintaining the operating system and DBMS, there has not been much development on the operating system and DBMS for years, and although the system works extremely efficiently it is 'becoming' outdated. (Flexibility of the system is limited).

SQL-server 1.

Has been designed from the start with clustered indexes on the subscriberid. And this has been build on. This system still works very efficiently on small systems with a large number of concurrent users. We never have had to tune the database for performance.

So we want to keep the clustered indexes on the subscriberid, because the performance is really good and the usage of the cache is realy efficient.

SQL-server 2.

Is not under our control, we have little influence on this design. Almost all subscriber tables are implemented as Heap, and depending on locationsize it can be partitioned.As said we have very limited influence on the design and on the implementation, even getting an extra index is often a problem.

All three systems are designed with OLTP in mind. Most operations, both selection and DML are geared towards that the subscriberid is known.

But requirements are changing.

Users want more reports and worklists. For worklists require actual data and real time generation. Often the selection is on a period.

Mostly a recent period, sometimes a period in the past.

The model came from another thread :

overlapping date ranges

This model has some similarities with our systems/requirements/etc. More specific for the reports and worklists which have to be created.

It does have a subscriber, it does have begin and end dates. In reality the complexity is far greater, some reports accessing something like 10 tables, where some tables are accessed more than once (sometimes up to six times).

Part of the selection is often the date(time) range, most often current or fairly recent. The model used came 'close' to the situation were we want to select on periods. A fast selection on a period, on a table (situation 1 a clustered table, situation 2 a heap table). Where the table can not be changed, but an index, or a indexed view is a possibility.

This is a requirement I have seen (and tuned) in the past and will occure again in the future. So although the model was only a very simplified version of our model, it was a good 'in vitro' technique to gain some knowledge for future tuning effords.

On a growing system, a full table scan does 'invalidate' the cache and might take a long time. The table might even been larger than the cache is.

Selections for worklist quite often result in a full table scan, not only performing slowly for the worklist itself, but because it has a large cache footprint often slowing down other queries as wel.

As for now, for the heap table some tested indexes work fairly wel.

But for the clustered table, in this model none of the indexes did wel. But to be fair, in the model the generated data was generated independend over time and subscribers, where in real life this data is far from random.

Building a model with a realistic spread of data is time consuming, I am thinking about is, but it is not easy.

With this exercise I hope to get some insight in how I can build this type of queries with at least an acceptable performance.

Ben

Note.

On our legacy system (on the PDP-11) we were extremely carefull with resources. Data was kept extreme compact. And discaccesses were realy kept to a minimum. For example getting to the main 'row' of a subscriber, only one disk access was used. Getting all the data from about 200 tables from the one subscriber only between 1 to 10 diskaccesses were used in total. Reporting was done at night, because reporting over all subscribers was less efficient.

Designing the database in a RDBMS, we used clustering on subscriberID to keep the number of diskaccesses while getting the information of a subscriber to a minimum. This still works very wel. But for worklists and reports we need some 'NEW' efficiency.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

Just to clarify, Ben - it's this system you are attempting to tune:

ben.brugman (2/23/2016)
SQL-server 1.
Has been designed from the start with clustered indexes on the subscriberid. And this has been build on. This system still works very efficiently on small systems with a large number of concurrent users. We never have had to tune the database for performance.
So we want to keep the clustered indexes on the subscriberid, because the performance is realy good and the usage of the cache is realy efficient.

If so, here are a couple of hints.

Keep the clustered indexes as they are because they are known to support your current workload, and examine usage (and non-usage) of the non-clustered indexes using the index dmv's. Invest some time into interpreting and understanding the results and if necessary resort to an industrial-strength indexing reference work such as this one[/url]. I'd recommend you brush up on covering indexes to begin with. You may find any or all of the following: ordinary indexes which are near duplicates where one can be dropped; ordinary indexes which are not used (or rarely used - but check impact); missing indexes which can be met by adding a column or two to INCLUDEs; missing indexes which have to be created. This single exercise will give you a broad picture of how your indexing strategy works for the majority of the queries hitting the server now - and will ensure that you are sufficiently well informed to tackle changes as they arise.

Forget about the odd focus on "usage of the cache" and focus instead on physical and logical reads - both of these metrics are available on a live system without poking around with server-wide settings and are well understood and documented. You'll get much the same information but with much finer granularity and you won't have to worry about interference from other queries. Single-query metrics are easy using SET STATISTICS IO ON / OFF before and after a query.

Finally, get used to examining and interpreting execution plans.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ben.brugman SSChampion Points: 13369 More actions · Answer 9

ChrisM@home (2/22/2016)
Hi Ben
The model system you have posted - does this model a production system or a system in development?
Assuming it's in development:

The model (from another thread) is to learn what to do in specific situations.

How representative of the queries in the system is the standard query you're using for testing? This one:
select sum(datalength(wide)) Some_field_information from testtable
where StartDate < '20160301' and (EndDate >= '20160227' or EndDate is NULL)
I'm asking this because you mention in the text that subscriberid will almost always be one of the searched columns, yet it's missing from the standard query.

The period selection is a report/worklist query. This does not represent most queries, but is an 'exception'. But can cause some problems, one is the runtime of the selection. But very important is the footprint it leaves in the cache and therefore invalidate a large part or all of the cached data.

There's a somewhat "shotgun" approach to indexing; with and without a clustered index, non-covering ordinary indexes, etc.

This is definitly a "shotgun" approach in a 'lab' environment, try as many different approaches (with educated designs and less educated designs) and see how they perform. To gain knowledge about 'the inner' workings and to learn about which approaches can work.

If you don't already have a clustered index on the table, then create one and leave it while you experiment with non-clustered indexes. Your choice of keys for the clustered index is important. If the table has an ID column for this purpose (SubscriptionID?) then a relatively safe option would be to use it for the cluster key. If not (or if you prefer natural keys) then a likely candidate based upon your posts so far might be subscriberid, startdate, enddate or subscriberid, enddate, startdate.

With our approach, the subscriberid (or actually a deriviate from that) works very wel for the actual OLTP actions.

Assuming that your table is in a production db, then there are better options than testing a random query against an artificial copy of it: you can examine the usage of existing indexes and the logging of missing indexes from the production db using some very standard methods. Armed with the information they provide, your indexing strategy is a lot less hit or miss and probably much faster too.

For our situation I do not agree with the above. From our experience with the legacy DB, we have knowledge about the data and how it is used. Designing and building the database and indexes based on that information has given us far better results than the 'missing indexes' suggestions has ever given us. And yes this thread is a 'shotgun' approach to see if we can use techniques which improve the report/worklist queries without affecting the OLTP actions. This forum allready been very valuable to me. Sometimes by showing solutions which were new to me, but also often by showing that I should be content with the solution I have. And I am still hoping for a 'briljant' solution I haven't thought of yet. Although I have tested the indexed view on the 'random' model and that did not give satisfactory results, I hope that for a set of real data it will give satisfactory results.

Thanks for your time and attention.

Ben

ben.brugman SSChampion Points: 13369 More actions · Answer 10

ChrisM@Work (2/23/2016)
Just to clarify, Ben - it's this system you are attempting to tune:

No, not limited to this system. More to get some good grip range queries. (With variations.)

this one[/url].

Most of this is familiar, but for the given problem I should have a closer look in 'Stairway to SQL Server Indexes: Level 7, Filtered Indexes', thanks for the link.

And thanks for the link:

'Exploring Recursive CTEs by Example Dwain Camps'

Ben

ben.brugman SSChampion Points: 13369 More actions · Answer 11

Found a solution with an indexed view.

Both selections on a recent period and on a period in the past perform fast.

The indexed view:

CREATE VIEW dbo.Vtesttable_recent

WITH SCHEMABINDING AS

SELECT

SubscriptionID

,SubscriberID

,StartDate

,EndDate

,wide

,convert(bit, CASE

when convert(int,EndDate-StartDate)< 30 then 0

else 1

end) period_length

FROM dbo.testtable

go

CREATE UNIQUE CLUSTERED INDEX IDX_V

ON dbo.Vtesttable_recent(period_length,enddate,startdate,subscriberID,SubscriptionID)

The code to be tested:

dbcc dropcleanbuffers

dbcc freeproccache

DECLARE @starttime2 datetime = getdate()

DECLARE @startofperiod datetime = '20160227'

DECLARE @endofperiod datetime = '20160301'

DECLARE @endofscan datetime = @endofperiod+30

SELECT @startofperiod startofperiod, @endofperiod endofperiod, @endofscan endofscan

select sum(datalength(wide)) Some_field_information from Vtesttable_recent

where (period_length =0 and enddate <= @endofscan and enddate >= @startofperiod and startdate <= @endofperiod)

UNION

select sum(datalength(wide)) Some_field_information from Vtesttable_recent

where (period_length = 1 and startdate <= @endofperiod and (enddate >= @startofperiod or enddate is null))

UNION

select sum(datalength(wide)) Some_field_information from Vtesttable_recent

where ((@startofperiod+30) <= @endofperiod) and

(startdate <= @endofperiod and (enddate >= @startofperiod and enddate is null))

--

-- Part 1

-- Takes care of the grunt of the rows. All subscriptions with a periodlength less of 30 are taken care off.

-- This takes care of most of the subscriptions.

-- Part 2

-- Takes care of the all subscriptions with a periodlength of more than 30.

-- This takes care of the openended subscriptions and

-- Part 3

-- Takes care of the situation where the selection period is longer than 30 days.

-- Be carefull this overlaps the other two queries.

-- This should be solved according to what is needed.

--

-- Metrics both timing and cache use are determined.

SELECT getdate(), '-- 2016' [--], DB_NAME() AS [Database Name]

,CAST(COUNT(*) * 8/1024.0 AS DECIMAL (10,2)) AS [Cached Size (MB)]

,convert(float,(GETDATE() - @starttime2))*60*60*24 duration

,convert(varchar(30),@startofperiod,126) startofperiod,convert(varchar(30),@endofperiod,126) endofperiod,''

FROM sys.dm_os_buffer_descriptors WITH (NOLOCK)

WHERE DB_NAME(database_id) = DB_NAME()

GROUP BY DB_NAME(database_id)

ORDER BY [Cached Size (MB)] DESC OPTION (RECOMPILE);

Timings.

Timings for the -- 2016 selection is around the .25 to .45 seconds. (With a cache usage of 9.38 Mb)

Timings for the -- 2006 selection is around the .50 to .75 seconds. (With a cache usage of 21.49 Mb)

Longer Periods.

Selections over a longer period (more than 30 days) became fast as wel, both recent and further in the past.

Selection of just over a month was around .50 to .80 seconds.

Selection of just over a year was around 2.8 seconds.

Using an OR instead of the UNION

Similar query with an OR instead of the UNION took about 33 seconds. (With a cache usage of 2525 Mb (almost all on the testmachine)).

So here replacing the OR with a UNION does improve the performance significantly.

Remark the UNION does not give exactly the same result. This is not addressed here. (I am lazy and this does complicate the query a bit).

Conclusion, even with an selection on two 'parameters' it is possible that with some effort to get a good performance.

Everybody thanks for stimulating me in finding this solution.:-):-D:-)

Ben

Reasoning

Most subscriptions will be less than 30 days in length.

For these subscriptions only a 'limited' scope over the index is needed. The search period has to be 'extended' with 30 days to cover all subscriptions which are less than 30 days in length. In the query this is taken care of by @endofscan, if the enddate is after that value, then it can not be a subscription of less then 30 days and fall in the selected period.

With the indexed view the clustered index seek is only over a small part of the total index. (From 30 to 60 days, of the total of 20 years).

A limited number of subscriptions is larger than 30 days. An fair estimate for our current date is that this is less than 1 percent.

So the second part of the UNION takes care of this part. So this is less than 1 percent of the total index which gets scanned.

The above does not take care of selection over a period of more than 30 days, the third part takes care of this. For this part we are less interrested in performance.

(Performance is good, which suprises me. Haven't studied this in detail).

Motivation

A worklist, which took about minutes to complete and which 'invalidated' the complete cache. This worklist was produced several times each day and did affect performance of the other queries. Now I have one more tool to solve this type of problem.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 12

Then you should try columnstore [/url]indexes too - another weapon in your armoury. I think your sample query could benefit a great deal from a columnstore index.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden