group by optimize

Question

Post reply

group by optimize

jraha

SSC Eights!

Points: 817
More actions
June 18, 2003 at 3:07 pm

#106578

I have a table, tShortTerm, that I need to store a crunched down version of for the long term, tLongTerm.
So the query that crunches it down looks like this:
insert into tLongTerm
select userId, max(date) from tShortTerm
group by userId
On tShortTerm I have a unique clustered index on userId and date desc but looking at the execution plan, I can see that it always does a scan (not a seek) on the clustered index.
I thought that if the unique clustered index was on userId and date desc that maybe query optimizer would be smart enough not to search further once it found the first row with a new userId since that row would be the row with the max(date) for that userID??
How can I make a query with a group by do a seek?
Thanks,
-J

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

mworku Ten Centuries Points: 1311 More actions · Answer 1

In most cases, Index selection is based on SARGs

which is determined by your [WHERE] clause.

The rule is that for an index to be considered by the optimizer, at least

the first column of the index must match the SARG

In your case you can write it the query

select userId, max(date)

from tShortTerm

Where userid IS NOT NULL -- SARG

group by userId

MW

stax68 SSChampion Points: 11711 More actions · Answer 2

The covering index (all fields required are in the index) will be helping performance compared to a table scan, especially if the table is wide. This is because the query does not need to referenec the data pages individually, but ony the (more compact)information in the index, which is also sorted ready for the aggregating operation.

But you can't expect an index seek, as the query requires that every date value be inspected for every userid.

You could use a trigger to maintain your summary table, but this is very likely to impact the performance of updates.

This query doesn't look like a killer. It also looks like the kind of thing you might schedule to run at a quiet time (if there is one).

Tim

Tim Wilkinson

"If it doesn't work in practice, you're using the wrong theory"
- Immanuel Kant

Tab Alleman SSCommitted Points: 1590 More actions · Answer 3

quote:
In most cases, Index selection is based on SARGs
which is determined by your [WHERE] clause.

Forgive the newbility, but WTF is a SARG? I've never seen it in WROX, and BOL is giving me a blank stare.

NPeeters SSChampion Points: 12580 More actions · Answer 4

From the Sybase site.

quote:
SARG (search argument) the query's qualifying clause(s).

In other words, the WHERE clause ...

mworku Ten Centuries Points: 1311 More actions · Answer 5

quote:
Tim:
Index covering is for non-clustered indexes, but in this case the aggregate
function would force index covering b/c the optimizer in almost all cases
would resort to index covering whenever it encounters aggregates
quote:
talleman:
but WTF is a SARG?
What is the point

MW

Edited by - mworku on 06/19/2003 08:26:10 AM

Edited by - mworku on 06/19/2003 08:26:42 AM

MW

jraha SSC Eights! Points: 817 More actions · Answer 6

The results are in:

quote:
WTF is a SARG?

That's a searchable argument I think... when you look at the execution plan at the bottom of the bubbles that pop up over each icon there is a section that says arguments. It first lists OBJECT indicating what index it used. Then if a SARG is present it may list SEEK: for instance where userID = @userid. Based off of the where clause, the optimizer will guess which index to use.

I ran the following three statements:

insert into tImpressionViewThroughBuilder

select userID, max(time)

from tImpressionWithID1

where userId is not null

group by userID

truncate table tImpressionViewThroughBuilder

insert into tImpressionViewThroughBuilder

select userID, max(time)

from tImpressionWithID1

group by userID

There are no null's in the userId column so the where clause has no effect on the actual data returned however, it had a big effect on the execution plans.

A clustered index seek was performed in the first block. 58% of the query runtime

A clustered index scan was performed in the second block. 83% of the query runtime

The first then used parallelism when performing the stream aggregate (4 processors on this machine) while the second didn't use parallelism.

At the end the runtime percentage was 41% vs 59% of the total time which cut the query from 3 minutes 43 seconds to 2 minutes 7 seconds... (I know the math doesn't work out, but there are some other things going on on the server)

Thanks all, but I'll be posting some more. Hopefully, my learning can help you all out aswell.

-J

Edited by - jraha on 06/19/2003 08:52:42 AM

Calvin Lawson SSChampion Points: 11030 More actions · Answer 7

jraha's query forces the index to be used, but you may want to rethink that composite index.

For example, tShortTerm has a one to many relationship with tLongTerm, but as long as new rows are inserted into tShortTerm sequentially by date then you should add an identity column to tShortTerm and use that instead.

For example, say you have a composite clustered index on identity (column stid) and userid. (having stid first will speed up your inserts to tLongTerm, but having userid first would probably be more useful when selecting.)

insert into tLongTerm

Selectst1.userid, st1.date

From

(

select userId, max(stid) as maxStid

from tShortTerm (nolock)

Where stid is not null--only necessary if stid if 1st in index

Group by userid

) st

JOINtShortTerm st1 (nolock) on st.maxStid= st1.stid

Having a nonclustered index on date is most likely a good idea as well, but experiment. Stay away from max(date) as often as you can!

cl

Signature is NULL

clemoni SSC Veteran Points: 249 More actions · Answer 8

Thank you NPeters,

People love to use big vocabulary when it is unecessary. I am with you.

What a stupidity! Let's get real and let's speak in English! Only DBA with complex will start talking with weird language.

clemoni SSC Veteran Points: 249 More actions · Answer 9

Hey guys,

This is very interesting. The where statement influencing the quesry optimization.

Let's get real!!!! MIcrosoft is very stupid... they can have had forecasted that; adding a where statement just to improve the performance is a non-sense but well, I guess it pays...sometimes Microsoft gives me a big laugh

Tab Alleman SSCommitted Points: 1590 More actions · Answer 10

quote:
but as long as new rows are inserted into tShortTerm sequentially by date then you should add an identity column to tShortTerm and use that instead.

Good point! unless he also frequently has to query on a date-range, in which case I'd think you would want the clustered index on the date column, but a non-clustered index on an identity column would probably still be a good idea for the max() query even then.

Thanks for the idea!

mworku Ten Centuries Points: 1311 More actions · Answer 11

quote:
Hey guys,
This is very interesting. The where statement influencing the quesry optimization.
Let's get real!!!! MIcrosoft is very stupid... they can have had forecasted that; adding a where statement just to improve the performance is a non-sense but well, I guess it pays...sometimes Microsoft gives me a big laugh

Come down bro.

How would you do it ?

MW

Edited by - mworku on 06/20/2003 11:19:24 AM

MW

group by optimize

Cookies on SQLServerCentral