Querying a large table with an odd criteria

Question

Querying a large table with an odd criteria

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 1

Craig Farrell (3/31/2011)
opc.three (3/31/2011)
You'll notice that a clustered index was not defined as part the CREATE TABLE statement and I was pointing out that in the absence of a, [explicitly] defined clustered index SQL Server will use the defined PRIMARY KEY, in this case AuditID, as the clustering index which is default behavior.
OPC, you are aware that an un-specified Primary Key definition is the clustered index by default, correct? It's PRIMARY KEY CLUSTERED. You only add on NONCLUSTERED when you want it differently. Thus, you're both saying the same thing, just very different approaches and I fear yours might confuse the newbie.

Yes 😛 I do, please try the code sample in my last post, it illustrates the exact point you and Wayne were both picking up on. We are all on the same page.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 2

opc.three (3/31/2011)
Craig Farrell (3/31/2011)
opc.three (3/31/2011)
You'll notice that a clustered index was not defined as part the CREATE TABLE statement and I was pointing out that in the absence of a, [explicitly] defined clustered index SQL Server will use the defined PRIMARY KEY, in this case AuditID, as the clustering index which is default behavior.
OPC, you are aware that an un-specified Primary Key definition is the clustered index by default, correct? It's PRIMARY KEY CLUSTERED. You only add on NONCLUSTERED when you want it differently. Thus, you're both saying the same thing, just very different approaches and I fear yours might confuse the newbie.
Yes 😛 I do, please try the code sample in my last post, it illustrates the exact point you and Wayne were both picking up on. We are all on the same page.

Okay, just making sure. The wording in the second post I wasn't sure if you meant the Primary Key would be used instead of the non-existant clustered index as the 'default behavior', or that it IS the clustered index as the 'default behavior'.

That and I've been working for 11 days straight, forgive me if I'm a bit dense or missed significant nuances... but at the moment I'm probably understanding as much as the poor new guy. :Whistling:

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

tfifield SSCrazy Eights Points: 9655 More actions · Answer 3

Giles,

As Kevin pointed out you're probably not going to get anything better than a table scan from the way your data is distributed in the table. You want to make that one pass as optimum as possible and never touch upon any row in the table more than once.

Something like this should give you all contacts that have only Email sent as their activity:

; WITH CTEContacts AS

( SELECT ContactID

, SUM(CASE WHEN AuditActivityName = 'Email Sent' THEN 1 ELSE 0 END) AS EMailCount

, SUM(CASE WHEN AuditActivityName <> 'Email Sent' THEN 1 ELSE 0 END) AS OtherCount

FROM Audits

WHERE g4c_auditdate > dateadd(yy,-1,getdate())

GROUP BY ContactID

)

SELECT ContactID

FROM CTEContacts

WHERE EMailCount > 0

AND OtherCount = 0

Todd Fifield

WayneS SSC Guru Points: 95461 More actions · Answer 4

opc.three (3/31/2011)
Wayne, I hear what you're saying, and you are absolutely correct, but it's not exactly relevant to my comment. I concede I could have worded my post better but you took my words out of context. Please re-read my post. You'll find that I was commenting on the DDL in the initial post creating the Audits table. You'll notice that a clustered index was not defined as part the CREATE TABLE statement and I was pointing out that in the absence of a, [explicitly] defined clustered index SQL Server will use the defined PRIMARY KEY, in this case AuditID, as the clustering index which is default behavior. I pointed this out in case it was not intended to have a column with data type UNIQUEIDENTIFIER as a single-column clustered index on the table...this was before I knew it was not the original poster's table design.

Okay, I see what you're saying and where you're coming from. To me, it wasn't that obvious - and part of that might have been that I wasn't reading everything in fine detail. After reading the rest of this thread, I do think that we're all on the same page.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 5

TheSQLGuru (3/31/2011)
3) no index will be helpful here if you plan on checking 80% of the data in the table (50M/62M) UNLESS the number of email sent rows is VERY LOW percentage of total. Very low in this case meaning like 1% or less. Outside of that, table scan will be (properly) chosen by the optimizer.

I did some quick tests using Red Gate data generator...Granted the count is non-trivial, however still not in nearing the 62MM we're discussing with the OP, but I wanted to see if I could get a general idea of what was happening with the estimated plans and maybe take away some general understanding from the analysis. Hopefully the comparison is not complete garbage....

I created an Audit table using the DDL in the OP and poopulated it with data using Data Generator fitting these stats:

> count: 1,550,200 (enough to get comparable plans with OP dataset?)

>count where AuditDate > 1 year: 352,233 (approx % of OP: 1550/352 ~= 62MM/12MM)

>count where AuditDate < 1 year: 1,197,967 (approx % of OP: 1550/1197 ~= 62MM/50MM)

>count where AuditDate < 1 year and Activity = 'Email Sent': 100,007 (OP said there were 12 activities so guessed 1/12 of set but this may be the biggest unknown, i.e. the % of rows newer than a year with 'Email Sent')

I then pulled some estimated plans and the query using the CTE was less costly by ~2/1 over the NOT EXISTS...learned something there for sure. I then added the covering index in my post above and the estimated plans then showed that the optimizer chose a plan that utilized the covering index making the NOT EXISTS outperform the CTE by ~2/1, which is what I was expecting to see.

<just_trying_to_learn>How would a complete table scan be better than a partial index scan on a covering non-clustered index that is more narrow than the table? Are you saying for large sets the optimizer will produce a plan that bypasses the index and will just do a full scan? I have heard of full table scans being picked when querying small lookup tables even when an index exists because the choice is negligible, however not on a very large table unless it's a SELECT * with no filter.</just_trying_to_learn>

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 6

Added attachments.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 7

opc.three (4/1/2011)
TheSQLGuru (3/31/2011)
3) no index will be helpful here if you plan on checking 80% of the data in the table (50M/62M) UNLESS the number of email sent rows is VERY LOW percentage of total. Very low in this case meaning like 1% or less. Outside of that, table scan will be (properly) chosen by the optimizer.
I did some quick tests using Red Gate data generator...Granted the count is non-trivial, however still not in nearing the 62MM we're discussing with the OP, but I wanted to see if I could get a general idea of what was happening with the estimated plans and maybe take away some general understanding from the analysis. Hopefully the comparison is not complete garbage....
I created an Audit table using the DDL in the OP and poopulated it with data using Data Generator fitting these stats:
> count: 1,550,200 (enough to get comparable plans with OP dataset?)
>count where AuditDate > 1 year: 352,233 (approx % of OP: 1550/352 ~= 62MM/12MM)
>count where AuditDate < 1 year: 1,197,967 (approx % of OP: 1550/1197 ~= 62MM/50MM)
>count where AuditDate < 1 year and Activity = 'Email Sent': 100,007 (OP said there were 12 activities so guessed 1/12 of set but this may be the biggest unknown, i.e. the % of rows newer than a year with 'Email Sent')
I then pulled some estimated plans and the query using the CTE was less costly by ~2/1 over the NOT EXISTS...learned something there for sure. I then added the covering index in my post above and the estimated plans then showed that the optimizer chose a plan that utilized the covering index making the NOT EXISTS outperform the CTE by ~2/1, which is what I was expecting to see.
<just_trying_to_learn>How would a complete table scan be better than a partial index scan on a covering non-clustered index that is more narrow than the table? Are you saying for large sets the optimizer will produce a plan that bypasses the index and will just do a full scan? I have heard of full table scans being picked when querying small lookup tables even when an index exists because the choice is negligible, however not on a very large table unless it's a SELECT * with no filter.</just_trying_to_learn>

1) Estimated plans are just that - estimates. The ACTUAL rownumbers and costs of those plans when executed can be many orders of magnitude off. I encourage you to actually run them and see if that happens. The ACTUAL plan can turn out to be disastrously bad if poor stats/estimates got that plan generated but the data didn't reflect those estimates.

2) If a COVERING nonclustered index exists and the optimizer has enough information to make a choice to use it (usually the case) then my comment about the optimizer switching from index seek/bookmark lookup for more than a VERY low percentage of the table is moot: there is no bookmark lookup in a covering index scenario.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service