April 19, 2013 at 10:15 pm
I've got a 350 million row partitioned table that I'm trying to aggregate. The plan I get uses a HASH Aggregate. I would like to get a query plan that uses a streaming aggregate so results can be returned to the client faster. Here is the original query (Schema is at the end):
SELECT userid, sessionid, username FROM dbo.AttemptedLogin
WHERE $Partition.PartitionByYear(datetime) = 1 AND sessionid IS NOT null
GROUP BY userid, UserName, SessionID
Schema for the Table and indexes are fairly straight forward. The clustered index is on (userid, datetime). There is a Non-clustered index on (Username, sessionid). Both indexes are partitioned on (DateTime). Since the query is limited to one partition and the columns being grouped on are either in the clustered index or the NC index key, the query optimizer should be able to generate a plan using a streaming aggregate that scans the NC index.
However, I am unable to produce the desired plan. I've tried 4 variations: with/without MAXDOP 1, with/without query "ORDER GROUP" query hint. The plans for these are in the attachment.
--This is the query that should be producing a streaming agg
SELECT userid, sessionid, username FROM dbo.AttemptedLogin
WHERE $Partition.PartitionByYear(datetime) = 1 AND sessionid IS NOT null
GROUP BY userid, UserName, SessionID
--Still get a Hash Aggregate without parallelism
SELECT userid, sessionid, username FROM dbo.AttemptedLogin
WHERE $Partition.PartitionByYear(datetime) = 1 AND sessionid IS NOT null
GROUP BY userid, UserName, SessionID OPTION (MAXDOP 1)
--Prevent HASH Aggregation, and we get a distinct sort instead
SELECT userid, sessionid, username FROM dbo.AttemptedLogin
WHERE $Partition.PartitionByYear(datetime) = 1 AND sessionid IS NOT null
GROUP BY userid, UserName, SessionID OPTION (ORDER GROUP)
--Still get distinct sort
SELECT userid, sessionid, username FROM dbo.AttemptedLogin
WHERE $Partition.PartitionByYear(datetime) = 1 AND sessionid IS NOT null
GROUP BY userid, UserName, SessionID OPTION (MAXDOP 1, ORDER GROUP)
It there something preventing the use of a streaming aggregate for this query, or is this a bug/limitation of the QO?
Create Schema for the table:
CREATE PARTITION FUNCTION [PartitionByYear](datetime) AS RANGE RIGHT FOR
VALUES (N'2005-01-01T00:00:00.000', N'2006-01-01T00:00:00.000', N'2007-01-01T00:00:00.000',
N'2008-01-01T00:00:00.000', N'2009-01-01T00:00:00.000', N'2010-01-01T00:00:00.000',
N'2011-01-01T00:00:00.000', N'2012-01-01T00:00:00.000', N'2014-01-01T00:00:00.000')
GO
/****** Object: PartitionScheme [PartYearsToArchiveAndPrimary] Script Date: 4/19/2013 11:04:53 PM ******/
CREATE PARTITION SCHEME [PartYearsToArchiveAndPrimary] AS PARTITION [PartitionByYear]
TO ([Archive], [Archive], [Archive], [Archive], [Archive], [Archive], [Archive], [Archive], [PRIMARY], [PRIMARY])
GO
CREATE TABLE [dbo].[AttemptedLogin](
[ObjectId] [int] IDENTITY(1,1) NOT NULL,
[DateTime] [datetime] NOT NULL,
[UserId] [int] NOT NULL,
[UserName] [varchar](30) NULL,
[Password] [varchar](30) NULL,
[SessionID] [varchar](30) NULL,
[IPAddress] [varchar](16) NULL,
[ProxyAddress] [varchar](16) NULL,
[CountryCode] [char](2) NULL,
CONSTRAINT [PK_NewAttemptedLogin] PRIMARY KEY NONCLUSTERED
(
[ObjectId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PartYearsToArchiveAndPrimary]([DateTime])
GO
/****** Object: Index [CIX_AttemptedLogin] Script Date: 4/19/2013 11:02:19 PM ******/
CREATE CLUSTERED INDEX [CIX_AttemptedLogin] ON [dbo].[AttemptedLogin]
(
[UserId] ASC,
[DateTime] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON
) ON [PartYearsToArchiveAndPrimary]([DateTime])
GO
/****** Object: Index [IX_AttemptedLogin_UserNameSessionID] Script Date: 4/19/2013 11:02:20 PM ******/
CREATE NONCLUSTERED INDEX [IX_AttemptedLogin_UserNameSessionID] ON [dbo].[AttemptedLogin]
(
[UserName] ASC,
[SessionID] ASC
)
WHERE ([SessionId] IS NOT NULL)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON,
FILLFACTOR = 90) ON [PartYearsToArchiveAndPrimary]([DateTime])
GO
Thanks!
April 19, 2013 at 10:21 pm
Forgot the include information about the machine:
Microsoft SQL Server 2012 (SP1) - 11.0.3128.0 (X64)
Dec 28 2012 20:23:12
Copyright (c) Microsoft Corporation
Enterprise Edition: Core-based Licensing (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)
April 20, 2013 at 12:05 am
Scanning the nonclustered index, there is an implied first key on the partition ID, so the following can use a Stream Aggregate:
SELECT sessionid, username
FROM dbo.AttemptedLogin
WHERE sessionid IS NOT null
GROUP BY $Partition.PartitionByYear([DateTime]), UserName, SessionID
Similarly, if a single partition is statically specified, the following also results in a Stream Aggregate, since the keys of the nonclustered index are (UserName, SessionID):
SELECT SessionID, UserName FROM dbo.AttemptedLogin
WHERE $Partition.PartitionByYear([DateTime]) = 1 AND SessionID IS NOT null
GROUP BY UserName, SessionID
The problem in the original code is that it adds UserID to the mix, and fails to use a Stream Aggregate:
SELECT UserId, SessionID, UserName FROM dbo.AttemptedLogin
WHERE $Partition.PartitionByYear([DateTime]) = 1 AND SessionID IS NOT null
GROUP BY UserId, UserName, SessionID
Now UserID forms part of the key of the nonclustered index since it is not declared as unique, but it comes after the nonclustered keys. Writing the GROUP BY clause to reflect the key order results in the Stream Aggregate you are looking for:
SELECT UserId, SessionID, UserName FROM dbo.AttemptedLogin
WHERE $Partition.PartitionByYear([DateTime]) = 1 AND SessionID IS NOT null
GROUP BY UserName, SessionID, UserId
Should the order of the GROUP BY keys matter? No. In that sense, it is a limitation, in the same way that writing a query with several window functions might result in an unnecessary sort depending on the written order of the query.
It may be more natural (and easier for future maintenance) if you explicitly include UserID in the nonclustered index definition.
edit: In response to your other Twitter question, yes the same considerations apply to DISTINCT:
-- Sort
SELECT DISTINCT UserId, UserName, SessionID FROM dbo.AttemptedLogin
WHERE $Partition.PartitionByYear([DateTime]) = 1 AND sessionid IS NOT null
-- No sort
SELECT DISTINCT UserName, SessionID, UserId FROM dbo.AttemptedLogin
WHERE $Partition.PartitionByYear([DateTime]) = 1 AND sessionid IS NOT null
Another option is to keep the GROUP BY or DISTINCT column order as it is, and add a presentation ORDER BY UserName, SessionID, UserId
April 20, 2013 at 6:36 am
I am glad you saw this one Paul! I didn't think anyone else could give as clear of a description of why the OP was getting the observed behavior.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply