Group and Max combination

Question

Group and Max combination

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply

pdanes SSCrazy Eights Points: 8497 More actions · Answer 1

Oh yeah, I forgot about the filter. It's actually the same filter as is used in my original query, but I forgot to put it back in when testing this stuff. It does include a lot of extra junk, though. Hang on...

Just tried it - good call. As soon as I added the condition in the filter, the query went to an index scan.

select EvidenceLetter , EvidenceNumber , EvidenceExtra
from (
    select EvidenceLetter , EvidenceNumber , EvidenceExtra, row_number() over(partition by EvidenceLetter ORDER BY EvidenceNumber DESC, EvidenceExtra DESC) AS row_num
    from dbo.Podrobnosti
where EvidenceLetter <> '1e'
) as evid_max
where row_num = 1

That index probably needs a look anyway. It's been on there a while and some of the conditions that applied when I first set it up have changed. It certainly no longer needs all those included columns.

It's still collecting and *-sorting-* all 120,000 rows, though, even from the index. That seems to me like a huge amount of unnecessary effort. The index is already ordered, in the order in which I want the data collected. The subquery here has a reverse sort, which I get is necessry for the syntax of getting the 'last' (biggest) number, but it seems to me like there should be a cleaner way of getting to this. In a procedural language, I would simply read the data, in the order in which it is already being retrieved, and note the last number when the first letter column changes. That seems like something SQL Server should be able to do, if I can figure out how to specify it properly.

pdanes SSCrazy Eights Points: 8497 More actions · Answer 2

All the indexes show non-clustered. I'm sure I had a clustered index on this originally, but I've done a good bit of expermenting with this table, and probably forgot this somewhere along the way. Do you think it would help? The query still specifies a reverse sort, and all that using a clustered index would do is substitute a table scan back for the current index scan, which at least reads less stuff off disk. I also wonder if specifying this index as descending order would help. I'll give it a try.

pdanes SSCrazy Eights Points: 8497 More actions · Answer 3

Jeez - I'm going senile. I read the index name wrong. Here is the correct index def, that it is actually using.

CREATE NONCLUSTERED INDEX [ix_AkcDruhaEvidence] ON [dbo].[Podrobnosti]
(
[AkcesAutoID] ASC,
[EvidenceLetter] ASC,
[EvidenceNumber] ASC,
[EvidenceExtra] ASC
)
INCLUDE ( [PodrobnostiAutoID]) 
WHERE ([EvidenceLetter]<>'1e')
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90)
GO

It has far fewer columns, but it isn't quite in the order I want - the first field is not in my ordering requirements. I'll try just making a new index from scratch, see if I can make that work properly, and then go back and go through all the indexes on this table. there is almost certainly some ballast that is no loner needed.

pdanes SSCrazy Eights Points: 8497 More actions · Answer 4

Here is the latest index, specifically for this query only.

CREATE NONCLUSTERED INDEX [ix_MaxDruhaEvidence] ON [dbo].[Podrobnosti]
(
[EvidenceLetter] deSC,
[EvidenceNumber] deSC,
[EvidenceExtra] deSC
)

WHERE ([EvidenceLetter]<>'1e')
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90)
GO

It works, the engine picked it up and used it immediately, but the query plan still shows a full scan, reading and sorting all 120,000 rows - the descending ordering didn't help.

There are only 42 rows in the final result. I almost wonder if doing an individual lookup for the max of each letter might not be faster.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 5

pdanes wrote:

All the indexes show non-clustered. I'm sure I had a clustered index on this originally, but I've done a good bit of expermenting with this table, and probably forgot this somewhere along the way. Do you think it would help? The query still specifies a reverse sort, and all that using a clustered index would do is substitute a table scan back for the current index scan, which at least reads less stuff off disk. I also wonder if specifying this index as descending order would help. I'll give it a try.

A clustered index would only be chosen by the optimiser if it's more efficient to scan it than an index designed to support the query.

https://www.brentozar.com/blitz/heaps-tables-without-primary-key-clustered-index/

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

pdanes wrote:

Oh yeah, I forgot about the filter. It's actually the same filter as is used in my original query, but I forgot to put it back in when testing this stuff. It does include a lot of extra junk, though. Hang on...
Just tried it - good call. As soon as I added the condition in the filter, the query went to an index scan.
select EvidenceLetter , EvidenceNumber , EvidenceExtra
from (
    select EvidenceLetter , EvidenceNumber , EvidenceExtra, row_number() over(partition by EvidenceLetter ORDER BY EvidenceNumber DESC, EvidenceExtra DESC) AS row_num
    from dbo.Podrobnosti
where EvidenceLetter <> '1e'
) as evid_max
where row_num = 1
That index probably needs a look anyway. It's been on there a while and some of the conditions that applied when I first set it up have changed. It certainly no longer needs all those included columns.
It's still collecting and *-sorting-* all 120,000 rows, though, even from the index. That seems to me like a huge amount of unnecessary effort. The index is already ordered, in the order in which I want the data collected. The subquery here has a reverse sort, which I get is necessry for the syntax of getting the 'last' (biggest) number, but it seems to me like there should be a cleaner way of getting to this. In a procedural language, I would simply read the data, in the order in which it is already being retrieved, and note the last number when the first letter column changes. That seems like something SQL Server should be able to do, if I can figure out how to specify it properly.

"You're always going to get a table scan for the first column. SQL Server won't use the index to skip from the first row of a partition to the first row of the next partition, which is I think what you're expecting to see, although there is a clever way of forcing SQL Server to do this (bookmark Paul White superfast distinct)."

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 7

Here's one way of obtaining the distinct first-column values without performing an index scan.

https://www.sqlservercentral.com/forums/topic/calculating-interest-query/page/4/#post-1244493

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

pdanes SSCrazy Eights Points: 8497 More actions · Answer 8

ChrisM@Work wrote:

Here's one way of obtaining the distinct first-column values without performing an index scan.
https://www.sqlservercentral.com/forums/topic/calculating-interest-query/page/4/#post-1244493

I'm looking at that, but it's not going well. I have three columns, not one, and the Min function on the three fields is giving me fits.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 9

pdanes wrote:

ChrisM@Work wrote:
Here's one way of obtaining the distinct first-column values without performing an index scan.
https://www.sqlservercentral.com/forums/topic/calculating-interest-query/page/4/#post-1244493
I'm looking at that, but it's not going well. I have three columns, not one, and the Min function on the three fields is giving me fits.

Use Paul's method for the first of your three columns because it's the only column you want DISTINCTed, then use the result set (as a CTE, derived table or a #temp table) to probe again into the table for the other two values. Holler if you get stuck.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

pdanes SSCrazy Eights Points: 8497 More actions · Answer 10

No luck. I've been beating my head against my desk for hours, trying all sorts of correlated subqueries and CTEs.

Most of them don't work at all, and the few that do are no improvement over the previous version, usually quite a bit worse.

I'm about ready to trade my brain in for a jelly donut.

pdanes SSCrazy Eights Points: 8497 More actions · Answer 11

I can't believe this is so damn difficult. In words, "Read this index, in the order in which it already exists, and return each occurrence of the last values of the three fields in the index just before the first field changes." Even the most primitive procedural language can handle that, and languages like RPG have stuff built in that does -exactly- that. It seems impossible that a system as sophisticated as T-SQL, developed expressly for retrieving data, doesn't have a simple way to do this.

ScottPletcher SSC Guru Points: 101115 More actions · Answer 12

I can understand SQL having to scan the table / index, but I don't see why SQL would need to do a sort. Btw, an asc index will do, you don't need a desc one, since SQL has backward pointers in indexes anyway.

I don't have time to check research full details on this now, but maybe try this weird quick-shot variation, although SQL "should" be able to "figure this out" itself anyway:

select EvidenceLetter , EvidenceNumber , EvidenceExtra

from (

select *, row_number() over(partition by EvidenceLetter ORDER BY EvidenceLetter DESC, EvidenceNumber DESC, EvidenceExtra DESC) AS row_num

from dbo.Evid

) as evid_max

where row_num = 1

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

pdanes SSCrazy Eights Points: 8497 More actions · Answer 13

ScottPletcher wrote:

I can understand SQL having to scan the table / index, but I don't see why SQL would need to do a sort. Btw, an asc index will do, you don't need a desc one, since SQL has backward pointers in indexes anyway.
I don't have time to check research full details on this now, but maybe try this weird quick-shot variation, although SQL "should" be able to "figure this out" itself anyway:
select EvidenceLetter , EvidenceNumber , EvidenceExtra
from (
select *, row_number() over(partition by EvidenceLetter ORDER BY EvidenceLetter DESC, EvidenceNumber DESC, EvidenceExtra DESC) AS row_num
from dbo.Evid
) as evid_max
where row_num = 1

Oh, it's using the index, all right, has been since you clued me in on the filter condition. But what's driving me batty is that it:

returns the entire contents of the index, around 120,000 rows, instead of just the 40 that I want;

and

2. sorts the entire set of 120,000 rows before discarding the unwanted ones.

I can sort of see why it's doing the sort, since the partition clause specifies an Order By. But none of this should be necessary. A simple index scan provides everything it needs for the result set, and allowed to go parallel, it should be instantaneous. The best of these still take around a full second to execute - completely ridiculous for the amount of work I actually want it to do.

The query you just sent me does a table scan, when I add the [where EvidenceLetter <> '1e'] condition, it does an index scan, but in both cases, it still returns the huge record set and sorts it, before discarding the 99.997% that I don't want, and which it should have realized right from the start I don't want.

I can't help feeling that I'm overlooking something very fundamental and very simple.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 14

pdanes wrote:

ScottPletcher wrote:
I can understand SQL having to scan the table / index, but I don't see why SQL would need to do a sort. Btw, an asc index will do, you don't need a desc one, since SQL has backward pointers in indexes anyway.
I don't have time to check research full details on this now, but maybe try this weird quick-shot variation, although SQL "should" be able to "figure this out" itself anyway:
select EvidenceLetter , EvidenceNumber , EvidenceExtra
from (
select *, row_number() over(partition by EvidenceLetter ORDER BY EvidenceLetter DESC, EvidenceNumber DESC, EvidenceExtra DESC) AS row_num
from dbo.Evid
) as evid_max
where row_num = 1
Oh, it's using the index, all right, has been since you clued me in on the filter condition. But what's driving me batty is that it:

<li style="list-style-type: none;">
returns the entire contents of the index, around 120,000 rows, instead of just the 40 that I want;

and
2. sorts the entire set of 120,000 rows before discarding the unwanted ones.
I can sort of see why it's doing the sort, since the partition clause specifies an Order By. But none of this should be necessary. A simple index scan provides everything it needs for the result set, and allowed to go parallel, it should be instantaneous. The best of these still take around a full second to execute - completely ridiculous for the amount of work I actually want it to do.
The query you just sent me does a table scan, when I add the [where EvidenceLetter <> '1e'] condition, it does an index scan, but in both cases, it still returns the huge record set and sorts it, before discarding the 99.997% that I don't want, and which it should have realized right from the start I don't want.
I can't help feeling that I'm overlooking something very fundamental and very simple.

I don't think you are. Quoting Paul White from the Superfast Distinct thread: "there was a column with very many duplicate values and it just seemed dumb to me that the optimizer would choose to scan the whole index rather than trying something a little more creative."

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 15

Here's exactly what I meant by using Paul White's method to grab the first column values only, then using that result set as the source for further processing. The first operator in the plan is an index scan, but it only reads a single row from the index - the first. After that it's all index seeks, and it completes in 1ms using my 234,256-row sample data set:

;WITH FirstColumn AS (
SELECT EvidenceLetter = MIN(e.EvidenceLetter)
FROM #Evid e
UNION ALL
SELECT f.EvidenceLetter
FROM (
SELECT  e.EvidenceLetter,
rn = ROW_NUMBER() OVER (ORDER BY e.EvidenceLetter)
FROM #Evid e
INNER JOIN FirstColumn f
ON f.EvidenceLetter < e.EvidenceLetter
) f
WHERE f.rn = 1
)
SELECT f.EvidenceLetter, c.EvidenceNumber, c.EvidenceExtra
FROM FirstColumn f
CROSS APPLY (
SELECT TOP(1) EvidenceNumber, EvidenceExtra 
FROM #Evid e 
WHERE e.EvidenceLetter = f.EvidenceLetter
ORDER BY EvidenceNumber DESC, EvidenceExtra DESC
) c
OPTION (MAXRECURSION 0);

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden