Need to group the same valued ID fields to run an aggregate average over them

Question

Post reply

Need to group the same valued ID fields to run an aggregate average over them

Zososql

Ten Centuries

Points: 1218
More actions
October 4, 2024 at 5:00 pm

#4465371

I have data that contains poll results, with the poll_id consistuting a single row of data for each candidate in said poll, so a poll_id will have a minimum of two different rows (not my design, parsed from ABC polling results). Sometimes there are more than 2 rows for each distinct poll_id as the survey may be graded multiple times. Whereas If the poll_id shows up more than twice, I want to take the average of these , the average of the Pct field, to list as the result. I'm wanting to do a derived column for this as truePct , but have a problem with HAVING filter, as it's not looking at the same poll_id, but all the poll_ids in the query (see the Select avg(pct) statement for truePct). ??? Thanks in advance.
SELECT poll_id ,start_date,end_date,
candidate_name, sample_size,
(select avg(pct) from [president_polls] where state = 'nevada' having count(poll_id) > 2) as truePct,
round([pct], 2) as Points
FROM president_polls
WHERE created_at >= '2024-9-1'
AND state = 'nevada'
AND candidate_name in ('Donald Trump','Kamala Harris')
DDL =
CREATE TABLE [president_polls](
[poll_id] [int] IDENTITY(1,1) NOT NULL,
[state] [nvarchar](50) NULL,
[start_date] [date] NULL,
[end_date] [date] NULL,
[sample_size] [smallint] NULL,
[created_at] [datetime2](7) NULL,
[candidate_name] [nvarchar](50) NULL,
[pct] [float] NULL
)

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Zososql Ten Centuries Points: 1218 More actions · Answer 1

CORRECTION:

poll_id is not a primary key or unique value, sits in a local dev table

Ken McKelvey SSCoach Points: 18842 More actions · Answer 2

You have over 1000 points so should know how to post a question by now.

How to post a T-SQL question on a public forum | spaghettidba

ScottPletcher SSC Guru Points: 101096 More actions · Answer 3

Maybe this?:

SELECT  poll_id ,start_date,end_date,
candidate_name, sample_size,
(select avg(pp2.pct) from [president_polls] pp2 
    where pp2.state = pp.state and pp2.poll_id = pp.poll_id 
    having count(*) > 2) as truePct,
round([pct], 2) as Points
FROM president_polls pp
WHERE created_at >= '20240901'
AND state = 'nevada'
AND candidate_name in ('Donald Trump','Kamala Harris')

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

drew.allen SSC Guru Points: 76985 More actions · Answer 4

ScottPletcher wrote:

Maybe this?:

SELECT  poll_id ,start_date,end_date,
candidate_name, sample_size,
(select avg(pp2.pct) from [president_polls] pp2 
    where pp2.state = pp.state and pp2.poll_id = pp.poll_id 
    having count(*) > 2) as truePct,
round([pct], 2) as Points
FROM president_polls pp
WHERE created_at >= '20240901'
AND state = 'nevada'
AND candidate_name in ('Donald Trump','Kamala Harris')

You've been around long enough that you should know that a correlated subquery is never going to perform as well as a windowed function.

SELECT  poll_id ,start_date,end_date,
candidate_name, sample_size,
AVG(pp.pct) OVER(PARTITION BY pp.[state], pp.poll_id) AS truePct,
round([pct], 2) as Points
FROM president_polls pp
WHERE created_at >= '20240901'
AND state = 'nevada'
AND candidate_name in ('Donald Trump','Kamala Harris')

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

ScottPletcher SSC Guru Points: 101096 More actions · Answer 5

My primary goal is to meet the stated business requirements for the code, which I don't believe your code does. You will list the "truePct" for even a single row, which OP explicitly stated -- and showed -- that they did not want to do.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

ScottPletcher SSC Guru Points: 101096 More actions · Answer 6

I'm not sure how we'd be able to determine whether more than 2 rows are present with that style of query, in order to meet the stated requirements. Then again, I'll admit I haven't spent much time thinking about that, since I'm very busy with other things. Hence, I tend to stick to a straight-line resolution to the problem presented, sometimes overly so.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

drew.allen SSC Guru Points: 76985 More actions · Answer 7

ScottPletcher wrote:

My primary goal is to meet the stated business requirements for the code, which I don't believe your code does. You will list the "truePct" for even a single row, which OP explicitly stated -- and showed -- that they did not want to do.

I did miss that, but it's easy to fix. This is part of the reason that we ask for sample data and expected results. We have nothing to compare our solutions to, so it's easy to miss criteria.

SELECT  poll_id ,start_date,end_date,
candidate_name, sample_size,
CASE WHEN COUNT(*) OVER(PARTITION BY pp.[State], pp.poll_id) > 1
     THEN AVG(pp.pct) OVER(PARTITION BY pp.[state], pp.poll_id)
     ELSE NULL
 END AS truePct,
round([pct], 2) as Points
FROM president_polls pp
WHERE created_at >= '20240901'
AND state = 'nevada'
AND candidate_name in ('Donald Trump','Kamala Harris')

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

ScottPletcher SSC Guru Points: 101096 More actions · Answer 8

It's "> 2" :-).

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

ScottPletcher SSC Guru Points: 101096 More actions · Answer 9

As to performance:

I figure SQL would have to partition the data twice to get the COUNT() and the AVG() from different queries, but I'd need to look at the query plan to be sure, of course.

Overall, though, given the limited number of rows (*), I sincerely doubt you'd see much of a performance difference between methods. Particularly so if the table is properly clustered by ( created_at ) first (assuming all queries generally follow the pattern above). Admittedly that clustering may not be the case, and a nonclustered index may be used instead, which could end up damaging performance regardless of method.

(*) There are only 50 (or so) states and only so many polls you can conduct in a limited time window.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".